Title: Bioinformatics Datamining Bacterial Genome Sequences
1BioinformaticsData-mining Bacterial Genome
Sequences
- Professor Mark Pallen
- University of Birmingham
2BioinformaticsDefinitions
- Fusion of Biology, Computer science, Mathematics
- Broad Meaning
- Any computationally intensive research with
biological relevance - Narrow meaning
- Computer-based analysis and archiving of
macromolecular sequence data
3The Scope of Bioinformatics
Algorithm Development
- Domain Hunting
- Detailed Analysis by Expert
- Interface Design
- Graphics
- Web
Large-scale (semi-) Automated Analysis
4Scale
- Large-scale
- (semi)-automated analysis of genome sequence
- Interface with functional genomics
- Medium-scale
- Domain hunting
- Small-scale
- Analysis of individual sequence by power-user
5(No Transcript)
6BioinformaticsEnabling Technologies
- Maths
- Algorithm Development
- Computer science
- Ever-increasing list of free software
- Bioinformatics programs
- Operating system LINUX
- Scripting languages glue programs together
- PERL
- Growth of Internet
- Distance is dead!, Distributed Resources
- User-friendliness of Web
- Just Cut and Paste!
7BioinformaticsChallenges
- DNA Protein Sequences
- Exponential increase
- Genome Sequencing
- Need for annotation
- Molecular stamp collecting?
- Role in drug discovery
- Big business
8BioinformaticsChallenges
- Interplay between the wet and the dry
- Bioinformatics Predictions
- range from the very general to the very specific
- range from highly speculative to the almost
certain - But in the end they are still only predictions
- Need for experimental confirmation
9Challenges The Data Flood
- 52 bacterial genomes completed and published
- 100,000 genes
- 228 genomes ongoing
- 450,000 more genes when finished
10The Post-Genomic Iceberg
Discovered Biology
The Undiscovered Genotype Most genes are of
unknown function Undiscovered genomic diversity
Undiscovered Biology
The Undiscovered Phenotype Most bacterial
physiology inapparent in the lab Undiscovered
regulators and regulons
11Bioinformatics Approaches
- Multiple levels of analysis
- Gene Finding
- Protein function prediction
- Power of homology
- Pitfalls of homology
- Comparative genomics
- Metabolism reconstruction
- Interface with functional genomics
12What is a Sequence?
- DNA Sequence, double stranded, antiparallel
- Conventionally written 5 to 3
- 5-ATGAGTACCG CTAAATTAGT TAAATCAAAA-3
- 3-TACTCATGGC GATTTAATCA ATTTAGTTTT-5
- RNA sequence, single stranded, U instead of T
- 5-AUGAGUACCG CUAAAUUAGU UAAAUCAAAA-3
- Protein sequence
- conventionally written N-terminal to C-terminal
- 3-letter code Met Ser Thr Ala Lys Leu
- 1-letter code MSTAKLVKSKATN
- Sequences usually written in a monospaced font
like Courier - Times Courier
- AGCGGGCGG AGCGGGCGG
- ATCGTTCTG ATCGTTCTG
13(No Transcript)
14First get your sequence!
- Most sequencing is now...
- Performed on DNA (rather than RNA or protein)
- Performed using the Sanger didexy method
- Exceptions...
- rRNA is sometimes sequenced directly
- N-terminal and mass spectrometry sequencing of
proteins - Template for sequencing can be
- DNA cloned in a plasmid (e.g. pUC19)
- DNA cloned in a single-stranded phage (e.g. M13)
- PCR products
15First get your sequence!
- Automated Sequencing
- Fluorescent dyes used
- Extract sequence from chromatogram
- Must extract only the error-free region
16Where to analyse?
17Assembling a contig
18Analysis of nucleotide sequence data
- Sequence Composition
- bacteria vary greatly in chromosomal GC
content each genus has characteristic GC - useful in
- checking you cloned what you think you have
- identifying foreign DNA in a genome
- Predicted Restriction maps
- helpful in the lab
19Analysis of nucleotide sequence data
- Search for Other Sequence Features
- Promoters
- Ribosome-binding Sites
- Repeats
- Inverted Repeats (e.g. terminators)
- Consensus Sequences for regulator binding sites
20Searching for coding regions
- Any given DNA sequence can be translated in 6
different reading frames, 3 on each strand
21ORF maps
22The Problem of Frameshift Errors
Actual sequence
10 20 30 40
50 60 70
ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTAT
ACCCGCAACGATGTCTCCGACAGCGAGAAA M S T A K L
V K S K A T N L L Y T R N D V S D
S E K V P L N L N Q K R P I
C F I P A T M S P T A R K E Y
R I S I K S D Q S A L Y P Q R
C L R Q R E K
10 20 30 40
50 60 70
ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTA
TACCCGCAACGATGTCTCCGACAGCGAGAA M S T A K L
V K S K S D Q S A L Y P Q R C L R
Q R E V P L N L N Q K A T N
L L Y T R N D V S D S E K E Y
R I S I K K R P I C F I P A T
M S P T A R K
Frameshifted sequence after single base error
Markov Models (GLIMMER) now commonly used to
predict coding regions
23Analysis of Protein Sequence Data
24Analysis of Protein Sequence Data Signal peptides
SignalP uses neural networks
25Homology
- Similarity that arises because of descent from a
common ancestor - The formation of different languages and of
distinct species, and the proofs that both have
been developed through a gradual process, are
curiously parallel We find in distinct languages
striking homologies due to community of descent,
and analogies due to a similar process of
formation Languages, like organic beings, can be
classed in groups under groups and they can be
classed either naturally according to descent, or
artificially by other characters The survival or
preservation of certain favoured words in the
struggle for existence is natural selection. - Charles Darwin, 1871 THE DESCENT OF MAN, Chapter
3
26Homology
the cat sat on the mat die Katze sass auf
der Matte
vgeGBant88-2 ITLITCVSVKDNSKRYVVAG vgeGEfae9-1
78 LTLITCDQATKTTGRIIVIA vgeGSpne1-403
MTLITCDPIPTFNKRLLVNF sortase_staur
LTLITCDDYNEKTGVWEKRK
27Homology
Sequence homology is not just sequence similarity!
Sequences 1, 1A, 1B and 2 are all homologous to
one another Another sequence 2 is similar to
sequences 1, 1A, 1B 2, but not homologous to
them as it does not share a common ancestor with
them Another sequence 1 is neither homologous
nor similar to sequences 1, 1A, 1B 2
28Types of Homology
29Sequence Databases
- All sequences when published are deposited in
Sequence Databases - Nucleic Acid Sequence Databases
- EMBL, Heidelberg, http//www.embl-heidelberg.de/
- GenBank, in the NCBI, USA, http//www.ncbi.nlm.nih
.gov/ - Protein Sequence Databases
- GenPept and TREMBL
- Curated database SwissProt, Geneva,
http//www.expasy.ch/sprot/ - Numerous others, reviewed every year in NAR
- Problem of sequence formats
- Simplest format is FASTA
- gtsequence name
- AATGATGCGTGATGATGATGATGACTGACTGATGATGAT
30Homology Searches
- The aim of homology searches is to identify
sequences within these databases that are
homologous to your sequence. - This involves comparing your sequence with all
the database sequences, looking for stretches of
sequence that appear to be similar, then scoring
the matches and ranking them. Usually a measure
of the significance of the match will be given.
31Homology Searches Translate first!
32Homology Searches with BLAST
- BLASTN
- Nucleotide query vs nucleotide database
- BLASTP
- protein query vs protein database
- BLASTX
- automatic 6-frame translation of nucleotide query
vs protein database - TBLASTN
- protein query vs automatic 6-frame translation of
nucleotide database - TBLASTX
- automatic 6-frame translation of nucleotide query
vs automatic 6-frame translation of nucleotide
database
33Typical Blast Output
Sum
Reading High Probability Sequences
producing High-scoring Segment Pairs
Frame Score P(N) N embX69337ECDPS
E.coli dps gene for binding protein 2 834
6.4e-109 1 gbU04242ECU04242 Escherichia
coli core starvation p... 3 828 2.7e-106
1 embX14180ECGLNHPQ Escherichia coli glutamine
permeas... 3 443 2.8e-53
1 gbU18769HDU18769 Haemophilus ducreyi fine
tangled p... 1 150 4.0e-18 2
dbjD01016ANALTI46 Anabaena variabilis lti46
gene. gte... 2 129 4.8e-12 2
gbM84990P26BPO Plasmid pOP2621 ORF1 gene,
5' end... -2 131 6.7e-09
1 gbU16121HPU16121 Helicobacter pylori
neutrophil act... 1 112 1.8e-06
1 gbM32401TRPTYF1 T.pallidum pallidum
antigen TyF1 g... 3 101 5.6e-06
2 embX71436RPNTRB R.phaseoli ntrB gene 1
67 0.76
2 gbL35598DRODGC1A Drosophila melanogaster
receptor g... 1 48 0.97 3
34Typical Blast Output
gbU18769HDU18769 Haemophilus ducreyi fine
tangled pili major pilin subunit gene Length
780 Plus Strand HSPs Score 150 (68.0 bits),
Expect 4.0e-18, Sum P(2) 4.0e-18 Identities
36/89 (40), Positives 46/89 (51), Frame
1 Query 30 ELLNRQVIQFIDLSLITKQAHWNMRGANFIAVH
EMLDGFRTALIDHLDTMAERAVQLGGV 89 E L
LLI K AHWN G FIAVHEMLD D D AER
LG Sbjct 253 EALQMRLQGLNELALILKHAHWNVVGPQFIA
VHEMLDSQVDEVRDFIDEIAERMATLGVA 432 Query 90
ALGTTQVINSKTPLKSYPLDIHNVQDHLK 118
G YPL QDHLK Sbjct 433
PNGLSGNLVETRQSPEYPLGRATAQDHLK 519
35Sequence alignments
Dps_trepo .........N MCTDGKKYHS TATSAAVGAS
APGVPDARAI AAICEQLRRH Dps_helico ..........
.......... .......... .......... MKTFEILKHL
Dps_anab .......... .......... .......MPR
INIGLTDEQR QGVINLLNQD MrgA ..........
.......... .......... MKTENAKTNQ TLVENSLNTQ
Dps_haemo MRSKTITFPV LKLTGQSQAL TNDMHKNADH
TVPGLTVATG HLIAEALQMR Dps_ecoli ..........
....MSTAKL VKSKATNLLY TRNDVSDSEK KATVELLNRQ
Dps_strep ........MT SQPHLHQHAA EIQEFGTVTQ
LPIALSHDAR QYSCQRLNRV Dps_trepo VADLGVLYIK
LHNYHWHIYG IEFKQVHELL EEYYVSVTEA FDTIAERLLQ
Dps_helico QADAIVLFMK VHNFHWNVKG TDFFNVHKAT
EEIYEEFADM FDDLAERLVQ Dps_anab LADSYLLLVK
TKKYHWDVVG PQFRSLHQLW EEHYEKLTEN IDAIAERVRT
MrgA LSNWFLLYSK LHRFHWYVKG PHFFTLHEKF
EELYD..... .HAAETWIPS Dps_haemo LQGLNELALI
LKHAHWNVVG PQFIAVHEML DSQVDEVRDF IDEIAERMAT
Dps_ecoli VIQFIDLSLI TKQAHWNMRG ANFIAVHEML
DGFRTALIDH LDTMAERAVQ Dps_strep LADTQFLYAL
YKKCHWGMRG PTAYQLHLLF DKHAQEQLEL VDALAERVQT
ClustalW most commonly used program Note problems
of indels and ragged ends Need for manual
refinement Multiple alignments useful for
identifying active sites and distant homology
36Into the twilight zoneThe search for distant
homologies
Signal Peptide
A
Proteins consist of domains
B
Signal Peptide
Transitivity of Homology
Coiled coil domain
C
Distant Homology
D
37Domain Hunting
Homology Search
Add to Alignment
38PSI-BLAST Position-Specific Iterated BLAST
- combines statistically significant alignments
produced by BLAST into a position-specific score
matrix - searches the database using this matrix
- allows multiple iterations of this process
- runs at approximately the same speed per
iteration as gapped BLAST - is much more sensitive to weak but biologically
relevant sequence similarities
39Role of Sequence Analysisin the Pre-Genomic Era
Confirm
- Sequence Analysis
- Homology
- Structural Features
Identify Clone Gene
Obtain Sequence
40Role of Sequence Analysisin the Post-Genomic Era
Obtain Sequence from Genome Project
Formulate Hypothesis
- Sequence Analysis
- Homology
- Structural Features
- Genomic Context
Novel Lab-based Experimental Programme Amplify
Clone Express Gene Create mutant etc.