Sequence Analysis Methods for comparative Genomics

About This Presentation

Title:

Sequence Analysis Methods for comparative Genomics

Description:

Identity-difference (default 1, 3), affine gap penalties. Blosum or Pam substitution matrix, affine gap penalties. Scoring system: ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 30

Provided by: isrecI

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Analysis Methods for comparative Genomics

1
Sequence Analysis Methods for comparative Genomics
Philipp Bucher

Comparative Genomics
SIB PhD program
Lausanne Feb 12-16 2007

2
Topics covered

Alignment methods for genomic DNA sequences
Comparative gene prediction
Comparative motif finding

3
Genome sequence comparison - Challenges

Very large sequences up to several 100 million
bp per chromosome
Only a small percentage of the genomes may be
alignable.
Chromosomal rearrangements
Insertions
Translocations
Duplications
Repetitive elements
simple repeats
lineage specific transposons
ancient transposons
Processed pseudogenes

4
Nothing in biology makes sense except in the
light of evolutionTh. Dobzhansky

In fact, many contemporary biologists dont care
about evolution but are more interested in
molecular function or drug discovery.
To those, comparative genomics is an extremely
powerful genome annotation technique.
Genome sequence comparison may helpful in
the assembly of a new genomes
predicting gene structures
the identification of non-coding regulatory
elements
In the recognition of orthologs
to discarding sequence regions not of interest in
a particular context
for structural and functional inferences from
substitution patterns

5
Sequence comparison tools overview

6
BLAST heuristics for protein sequence similarity
searches
1. Query compilation For each position of the
query sequence, compile a list of 3-letter words
which match the query with score threshold T
(score depends on substitution matrix). 2. Word
search Search for two non-overlapping word
matches on the same diagonal of the path matrix
within a distance A. 3. Un-gapped match
extension Extend the un-gapped alignment along
the diagonal in both directions unless the score
drops more than X below the maximal score yet
attained. 4. Gapped alignment extension For
un-gapped alignments (segment pairs) exceeding a
threshold score Sg , trigger a regional gapped
alignment until the score drops more than Xg
below the maximal score yet attained. 5.
Statistical evaluation of gapped alignments
Blast uses hard-wired extreme value distribution
parameters ? and K for specific scoring systems
(substitution matrix gap penalties)
7
Examples of word match scores with Blosum62 matrix
WWW -gt 33 WWW YYY -gt 21 YYY GGG -gt
18 GGG QQQ -gt 15 QQQ SSS -gt 13 SSS HSW -gt
20 HAW LFY -gt 12 IYY EWD -gt 15 DWE KFI
-gt 8 RYV EWC -gt -11 FDE
Matrix made by matblas from blosum62.iij
column uses minimum score BLOSUM Clustered
Scoring Matrix in 1/2 Bit Units Blocks
Database /data/blocks_5.0/blocks.dat Cluster
Percentage gt 62 Entropy 0.6979, Expected
-0.5209 A R N D C Q E G H I L K
M F P S T W Y V B Z X A 4 -1 -2 -2
0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2
-1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1
-3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3
0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3
0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3
-3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9
-3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3
-3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0
-3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4
2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1
4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3
-3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3
0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0
0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1
0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1
-2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4
-3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1
-3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1
0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3
-1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3
-1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2
-1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1
-2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1
-1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1
-1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1
1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2
-1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3
-2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1
-1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3
0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4
1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1
-3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2
-1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
-1 -1 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
8
Blast Principle of two-hit heuristic
9
Blast Gap extension.
a Area considered by gap extension algorithm.
Computation starts at the center of the highest
scoring segment pair obtained after word-hit
extension. b Path of optimal local alignment
found. c Optimal local alingment.
10
Why is blastn less sensitive than blastp ?

11
Blastz overview

Searches for multiple gapped local alignments
ordering constraints may apply
Same 3-step procedure as blast
word hit, ungapped extension, gapped extension
Word hits
exact (default 8) or patterned, e.g.
1110100110010101111 (1 indicates relevant
positions).
optionally transition-tolerant (AG, CT count as
matches)
two hits strategy like blastp
Species-pair specific base substitution matrix
Alignment scores dynamically adjusted for local
complexity
Between adjacent alignments search repeated with
more sensitive parameters
Output
Coordinates of local alignments and scores
For each alignment, coordinates and scores of
HSPs (high-scoring segment pairs)

12
Comparative gene prediction

Categories of gene prediction programs
Single-genome ab initio (classical gene
prediction problem)
Dual-genome ab initio (comparative gene
prediction)
EST-, mRNA, and protein-based methods (same or
related species)
Anything goes (e.g. ENSEMBL gene annotation
pipeline)
Why should comparative gene prediction be more
effective
Consensus of two predictions more reliable than
one
Coding regions more conserved than non-coding
regions (??)
Coding regions have different substitution and
indel patterns (frame conservation)
Splice junctions need to be conserved as well
Some limitations
Genomes need to be reasonably diverse
Gene structures may not be conserved

13
Gene prediction statement of the problem

Given a genome sequence
predict the structure of all transcripts
Predict the structure of the coding part of all
transcripts.
Performance criteria
of correctly predicted/missed genes
of correctly predicted/missed exons
of correctly predicted/missed coding
nucleotides
Further complications
It is not known in advance how many genes a
sequence contains
The sequence may start or end in the middle of a
gene
Alternative splicing

14
GENSCAN, and example of an ab initio gene finding
algorithm

Principle GENSCAN finds the optimal parse of a
sequence
A parse is a succession of
intergenic regions
5UTR (untranslated regions)
Exons
Introns
3UTRs
Evaluation of alternative parses with the aid of
Weight matrices or similar models for sites
promoters, translation starts, splice donors and
acceptors, translation stops, polyadenylation
sites.
Interpolated Markov chains (3-periodic HMMs), and
length distributions for exons, introns, 5 and
3 UTRs, and intergenic regions.

15
GENESCAN model (variant HMM)
E0 exon plus starting in phase 0 I0 exon
on starting in phase 1
All named elements except promoters are scored by
Markov chains or interpolated Markov
chains. Transitions and promoters are scored by
weight matrices or similar descriptors
16
Gene Prediction The Intron Phase Problem
ATGGGTCCAgttggtgcccttagGTGTTCGAgtgagccacagCACCTGGA
AG MetGlyPro..............ValTyrAs...........pThrT
rpLys
EInit
I0
E0
I2
E2
Introns can be inserted at three different codon
positions Introns inserted between codons have
phase 0 Introns inserted after the first codon
base have phase 1 Introns inserted after the
second codon position have phase 2 Exons starting
between codons have phase 0 Exons starting after
the first codon base have phase 1 Exons starting
after the second codon base have phase
2 Important rule An intron of a given phase
must be followed by an exon of the same phase.
Otherwise, the protein will not be translated in
the correct frame.
17
Scoring target Scoring method F 5UTR simpl
e Markov chain F ? Einit Kozak
consensus weight matrix Einit First
exon interpolated Markov chain Einit ? I1 Splice
donor weigh matrix I1 Intron phase
1 interpolated Markov chain I1 ? E1 Splice
acceptor weigh matrix - like
18
Example of a dual gene prediction program SLAM

Input
two sequences
Approximate alignments generated with Avid or
Mummer
Coupled gene prediction and alignment
optimization
Based on a generalized pair HMM (GPHMM)
gene model analogous to GENSCAN
arbitrary length distributions (two sets of
durations instead of one)
exon model based on codon pairs in different
alignment configurations (new!)
Splice junctions modeled by VLMMs (non-stationary
variable-length Markov models)
Models individually trained for different species
Other examples of dual gene prediction programs
SGP2
TWINSCAN

19
Evaluation of gene predictions(from Guigo et al.
2006, Genome Biol. 7S2)
20
Evaluation at gene level
Gene transcript evaluation. Computing sensitivity
and specificity at transcript level (a) complete
transcript annotation (b) incomplete transcript
annotation. Transcripts marked with an asterisk
are considered 'consistent with the annotation'
and will be scored as correct.
21
Performance Evaluation of gene prediction
programs based on experimentally characterized
EGASP regions
22
Computational Approaches to Gene Regulatory
elements
Commonly addressed problems and corresponding
bottlenecks

Finding a common sequence motif in a set of
sequences known to contain a binding site to the
same transcription factor or to confer the same
regulatory property to an adjacent gene.
Bottleneck appropriate data sets, computer
algorithms. New perspective ChIP-chip data.
Identification of sequence motifs that are
over-represented at a particular distance from
transcription initiation sites. Bottleneck large
sets of experimentally mapped promoters. Hope
comes from mass genome annotation (MGA) data
(CAGE, etc.).
Identification of transcription factor binding
sites and other sequence elements in DNA
regulatory sequences. Bottleneck accurate and
reliable motif descriptions.
Development of promoter prediction algorithms.
Current bottleneck large sets of experimentally
mapped promoters. Perspective MGA data.
Identification and interpretation of conserved
non-coding sequence regions between orthologous
genes of related organisms. Current bottleneck
concepts and models of gene regulatory regions.

Comparative genomics may help solving these
problems
23
Transcription Factor Binding Sites Features and
Facts

Degenerate sequence motifs
Typical length 6-20 bp
Low information content 8-12 bits (1 site per
250-4000 bp)
Quantitative recognition mechanism measurable
affinity of different sites may vary over three
orders of magnitude
Regulatory function often depends on cooperative
interactions with neighboring sites

24
Limitations of Motif Discovery Algorithms

A recent paper show that computational motif
discovery is disappointingly ineffective

25
(No Transcript)
26
Bad Performance of Motif Discovery algorithms on
Eukaryotic Benchmark Data Sets (Results from
Tompa et. al. 2005)
27
Similar results are obtained with prokaryotic
benchmark datasets
28
Comparative Motif Discovery Problem statement
and strategies
Problem 1 Motif discovery Input several sets of
aligned or unaligned sequences. Output a set of
over-, under-, or class-associated motifs Motifs
may be represented as consensus sequences or
weight matrices Case study Xie et al. (2005)
mammalian promoters and 3UTRs Program example
PhyloGibbs (Siddhartan et al. 2005)
http//www.phylogibbs.unibas.ch/cgi-bin/phylogibb
s.pl Problem 2 Motif search Input one set of
aligned or unaligned sequences, a library of
motif definitions (e.g. TRANSAC matrices) Output
conserved binding site locations Program example
ConSite (Sandelin et al. 2004), CisOrtho (Bigelow
et al. 2004)
29
New human promoter motifs discovered by
comparative approach
Data from Xie et al. 2005, Nature 434, 338-345

Write a Comment

User Comments (0)

About PowerShow.com

Sequence Analysis Methods for comparative Genomics - PowerPoint PPT Presentation

Sequence Analysis Methods for comparative Genomics

Identity-difference (default 1, 3), affine gap penalties. Blosum or Pam substitution matrix, affine gap penalties. Scoring system: ... – PowerPoint PPT presentation