Sequence Alignment Techniques

About This Presentation

Title:

Sequence Alignment Techniques

Description:

Title: PowerPoint Presentation Author: T. Viswanath Last modified by: T. V. Prasad Created Date: 3/15/2003 8:00:39 AM Document presentation format – PowerPoint PPT presentation

Number of Views:252

Avg rating:3.0/5.0

Slides: 37

Provided by: TVisw8

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignment Techniques

1

Sequence Alignment Techniques

2
In this presentation

Part 1 Searching for Sequence Similarity
Part 2 Multiple Sequence Alignment

3
Part1

Searching for Sequence Similarity

4
Sequence similarity searches

Sequence similarity searches of database enable
us to extract sequences that are similar to a
query sequence
Information about these extracted sequences can
be used to predict the structure or function of
the query sequence
Prediction using similarity is a powerful and
ubiquitous idea in bioinformatics. The
underlying reason for this is molecular evolution

5
Sequence alignment

Any pair of DNA sequence will show some degree of
similarity
Sequence alignment is the first step in
quantifying this in order to distinguish between
chance similarity and real biological
relationships
Alignments show the differences between sequences
and changes (mutations), insertions or deletions
(indels or gaps) and can be interpreted in
evolutionary terms

6
Alignment algorithms

Dynamic programming algorithms can calculate the
best alignment of two sequences
Well-known variants are
the Smith-Waterman algorithm (local alignments)
the Needleman-Wunsch algorithm (global
alignments)
Local alignments are useful when sequences are
not related over their full lengths, e.g.,
proteins sharing only certain domains or DNA
sequences related only in exons

7
Alignment scores and gap penalties

A simple alignment score measures the number or
proportion of identically matching residues
Gap penalties are subtracted from such scores to
ensure that alignment algorithms produce
biologically sensible alignments without many
gaps
Gap penalties may be constant (independent of the
length of the gap), proportional (proportional to
the length of the gap) or affine (containing gap
opening and gap extension contributions)
Gap penalties can be varied according to the
desired application

8
Similarity and homology

Similarity may exist between any sequences
Sequences are homologous only if they have
evolved from a common ancestor
Homologous sequences often have similar
biological functions (orthologs), but the
mechanism of gene duplication allows homologous
sequences to evolve different functions (paralogs)

9
Similarity search in databases

Sequences similar to a query can be found in a
database by aligning it to each database sequence
in turn and returning the highest scoring (most
similar) sequences
This can be achieved by dynamic programming
algorithms but in practice faster approximate
methods are often used

10
Statistical scores

The p value of a similarity score is the
probability of obtaining a score at least as high
in a chance similarity between two unrelated
sequences of similar composition
Low p values indicate significance matches that
are likely to have real biological significance
The related E value is the expected frequency of
chance occurrences scoring at least as high as
the identified similarity
A low p value for a similarity between two
sequences can translate into a high E value for a
search of a large database

11
Sensitivity and specificity

These measures quantify the success of a database
search strategy
Sensitivity measures the proportion of real
biological sequence relationships in the database
that were detected as hits in the search
Specificity is the proportion of the hits
corresponding to real biological relationships
Changing E and p value thresholds results in a
trade-off between these complementary measures of
success

12
Maximizing amino acid identities

Protein sequences can be aligned to maximize
amino acid identities, but this will not reveal
distant evolutionary relationships

13
Evolution

Protein-coding sequences evolve slowly compared
with most other parts of the genome, because of
the need to maintain protein structure and
function
An exception to this is the fast evolution that
might occur in the redundant copy of a recently
duplicated gene

14
Allowed changes

Changes in protein sequences during evolution
tend to involve substitutions between amino acids
with similar properties because these tend to
maintain the structural stability of the protein

15
Substitution score matrices

These matrices give scores for all possible amino
acid substitutions during evolution
Higher scores indicate more likely substitutions
Example matrices are BLOSUM62 and PAM250
PAM stands for Accepted Point Mutations, and in
this case, the evolutionary distance of the
matrix is 250 amino acid changes per 100 residues
Dynamic programming algorithms for sequence
alignment can operate using scores from these
matrices

16
Significance of score matrices

Substitution score matrices allow detection of
distant evolutionary relationships between
protein sequences
It is possible to detect much more distant
relationships by comparing protein sequences than
by comparing nucleic acid sequences

17
Part of the sequence of human Huntingtons
disease protein (Huntingtin) showing low
complexity regions (underlined) associated with
compositional bias towards glutamine (Q) and
proline (P)

MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ QQQQQQQQQQ
PPPPPPPPPP PQLPQPPPQA QPLLPQPQPP PPPPPPPPGP
AVAEEPLHRP KKELSATKKD RVNHCLTICE NIVAQSVRNS
PEFQKLLGIA MELFLLCSDD AESDVRMVAD ECLNKVIKAL
MSDNLPRLQL ELYKEIKKNG APRSLRAALW RFAELAHLVR
PQKCRPYLVN LLPCLTRTSK RPEESVQETL AAAVPKIMAS

18
A dot plot of human pleckstrin sequence against
itself produced with Erik Sonnhammers dotter
program. The sequence is plotted from N- to C-
terminus along horizontal and vertical axes
between residues 1 and approximately 350.
PLEK_HUMAN (horizontal) vs. PLEK_HUMAN (vertical)
19
The PAM250 matrix and alignment of sequences.
Total alignment scores for two matrices should
not be compared, but note that the PAM matrix is
able to detect a much better alignment in second
halves of these sequences rather than identity
matrix. With the introduction of a single gap,
sensible alignments of hydrophobic amino acids,
and alignment of K with R (both basic), D with E
(both acidic) and F with Y (both aromatic) can be
seen
C 12 S 0 2 T 2 1 3 P 1 1 0 6 A 2 1 1
1 2 G 3 1 0 1 1 5 N 4 1 0 1 0 0 2 D
5 0 0 1 0 1 2 4 E 5 0 0 1 0 0 1 3
4 Q 5 1 1 0 0 1 1 2 2 4 B 3 1 1 0
1 2 2 1 4 3 6 R 4 0 1 0 2 3 0 1 1
1 2 5 K 5 0 0 1 1 2 1 0 0 1 0 3
5 M 5 2 1 2 1 3 2 3 2 1 2 0 0 6 I
3 1 0 2 1 3 2 2 2 -2 -2 -2 2 2 5 L 6
3 2 3 2 4 3 4 3 -2 -2 3 3 4 2 6 V 2
3 0 1 0 1 2 2 4 -2 -2 2 2 2 4 2 4 F
4 3 3 5 4 5 4 6 5 -5 2 4 5 0 1 2
1 9 Y 0 3 3 5 3 5 2 4 4 4 0 4 4 2
1 1 2 7 10 W 8 2 5 6 6 7 4 7 7 5 3
2 3 4 5 5 6 0 0 17 C S T P A G N
D E Q H R K M I L V F Y W
Sequence 1 MIIVKP VVLKGDFG Sequence 2
MILLKP AIIIRAEY- Position score 656256 044231370
20
Figure 3. Display of the DNA unit. DNA can be
described at several levels of detail. At the
most detailed level, DNA can be characterized by
the 5' and 3' termini at both external and
internal positions at the most abstract level,
the substrate DNA can be one of 16 common
structures. The goal is to provide methods for
specifying the properties of DNA in as many ways
as is natural for a scientist.
21
Figure 7. An initial experimental environment.
The temperature is 37 degrees Celsius and the pH
value is 7.4. No DNA polymerase I activity is
possible
22
Part2

Multiple Sequence Alignment

23
Non specific sequence similarity

Certain types of sequence similarity are less
likely to be indicative of an evolutionary
relationship than others are
Examples of this are similarity between regions
of low compositional complexity, short period
repeats and protein sequences coding for generic
structures like coiled coils

24
Similarity search filters

Regions of the non specific sequence types can
degrade the results of similarity searches and
are often filtered out of query sequences prior
to searching
The programs SEG and DUST can be used to detect
and filter low complexity sequences, XNU can
filter short period repeats and COILS can detect
the presence of potential coiled coil structures

25
Database types for searches

Database and query sequences can be protein or
nucleic acid sequences and different query
strategies are required for different types and
combinations
In general, searches are more sensitive using
strategies where protein-coding nucleic acid
database and/or query sequences are first
translated to protein sequences

26
Iterative database searches

PSI-BLAST is an iterative search method that
improves on the detection rate of BLAST and FASTA
Each iteration discovers intermediate sequences
that are used in a sequence profile to discover
more distant relatives of the query sequence in
subsequent iterations
Potential problems with PSI-BLAST are associated
with the potential for unrelated sequences to
pollute the iterative search, and difficulties
associated with the domain structure of proteins
PSI-BLAST often detects up to twice as many
evolutionary relationships as BLAST

27
Multiple sequence alignment

Multiple alignment illustrates relationships
between two or more sequences
When the sequences involved are diverse, the
conserved residues are often key residues
associated with maintenance of structural
stability or biological function
Multiple alignments can reveal many clues about
protein structure and functions

28
Multiple alignment
Part of a (artificial) multiple alignment of a
family consisting of 7 sequences, which subdivide
into 3 subfamilies. The bars on the left indicate
subfamilies the dotted boxes highlight
conservation patterns.
29
Progressive sequence alignment

Most commonly used software uses the method of
progressive alignment
This is a fast method, but frozen-in errors mean
that it does not always work perfectly
Biological knowledge can provide information
about likely alignments, and where automatically
produced alignments turn out to be imperfect,
software for manual alignment editing is required

30
Protein families

Assigning sequences to protein families is a very
valuable way of predicting protein family
(consensus sequences, conserved residues, residue
patterns, sequence profiles, etc.)
Many ways have been developed to represent
protein family information and these have been
stored in secondary protein family databases

31
Consensus sequences

These condenses the information from a multiple
alignment into single sequence
Their main shortcoming is the inability to
represent any probabilistic information apart
from the most common residue at a particular
position
Derivation of consensus sequence illustrates that
any protein family representation is subject to
bias if the set of sequences from which it was
derived is biased

32
PRINTS and BLOCKS

These represent protein families of multiply
aligned ungapped segments (motifs) derived from
the most highly conserved regions of sequences
By representing more of the sequence, they have
the potential to be more sensitive than short
PROSITE patterns
The ability to match in only a subset of the
motifs associated with a particular family means
that they have the ability to detect splice
variants and sequence fragments and to represent
subfamilies
WWW-based search engines for the databases are
available

33
Protein domain families

Many proteins are built up from domains in a
modular architecture
The study of protein families is best pursued as
a study of protein domain families
Prodom is a database of protein domain sequences
created by automatic means from the protein
sequence databases

34
Resources for domain families

Pfam and SMART can be used for protein domain
family analysis
The integrated resource Interpro unites PROSITE,
PRINTS, Pfam, Prodom and SMART

35
Visualization of similarities

Dot plots are a very good way to visualize
sequence similarity and find repeats

36
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Sequence Alignment Techniques - PowerPoint PPT Presentation

Sequence Alignment Techniques

Title: PowerPoint Presentation Author: T. Viswanath Last modified by: T. V. Prasad Created Date: 3/15/2003 8:00:39 AM Document presentation format – PowerPoint PPT presentation