Sequence alignment wrap up session - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Sequence alignment wrap up session

Description:

Query sequence: Human TATA-box binding protein TBP ... (TATA sequence-binding protein) (TBP) (Box A binding. protein) (BAP).[Cenarchaeum symbiosum] ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 9
Provided by: isrecI
Category:

less

Transcript and Presenter's Notes

Title: Sequence alignment wrap up session


1
Sequence alignment wrap up session
  • Program
  • More on alignment score statistics
  • Effects of compositionally biased sequence
    regions on alignment scores
  • Blast heuristics for protein sequences
  • Exercise
  • Analyzing the sensitivity of Blast with different
    parameter settings
  • Local alignments and blast searches with
    compositionally biased proteins

EPFL Bioinformatics I 14 Nov 2005
2
Alignment score statistics
Basic problem What is the probability that the
best local alignment score between two random
sequences exceed sa threshold score S ? Major
finding the random score distribution follows a
so-called extreme value distribution P(S x)
exp(-KNe?x) Here, N denotes the search space (the
product of the query and database sequence
lengths). Based on this formula, and assuming a
Poisson distribution for multiple matches, the
number of expected matches with score S is
computed as follows E(S) KNe-?S For un-gapped
alignments (also called high-scoring segment pair
or HSPs) the parameters ? and K can be computed
analytically from the scoring matrix, and the
background residue frequencies (Karlin-Altschul
statistics). For gapped alignments, these
parameters have to be estimated by computer
simulations.
EPFL Bioinformatics I 14 Nov 2005
3
Problems with compositionally biased sequence
regions
Many proteins contain compositionally biased
(low-complexity) regions, for instance glutamine-
or serine-rich regions. Such regions tend to
produce high-scoring matches, even in the absence
of true sequence similarity based on a similar
ordering of different residue types. Compositional
matches usually do not reflect phylogenetic
relationship or 3D-structural similarity and are
thus considered uninteresting. Shuffling tests
for evaluating the significance of a match take
into account this effect by randomizing the
ordering of amino acids in successive
non-overlapping windows of the sequences. One way
to prevent compositional matches in database
searches is to mask low-complexity regions by
place-holder characters for unknown residues
prior to database similarity search.
EPFL Bioinformatics I 14 Nov 2005
4
Example of an alignment between compositionally
biased regions
Query sequence Human TATA-box binding protein TBP
gtspQ8IZL2MAML2_HUMAN (MAML2)Mastermind-like
protein 2 (Mam-2).Homo sapiens
Length 1153 Score 94.4 bits (233),
Expect 5e-19 Identities 62/139 (44),
Positives 71/139 (50), Gaps 21/139
(15) Query 16 SPQGAMTPGIPIFSPMMPYGTGLT----PQPI
QNTNSLSILEEQQRQQQQQQQQQQQQQQ 71 P AM
P P P N SL
QQQQQQQQQQQQQQQ Sbjct 544 NPHPAMEPRQGNTKPLFHFNSD
QANQQMPSVLPSQNKPSLLHYTQQQQQQQQQQQQQQQQ
603 Query 72 QQQQQQQQQQQQQQQ-----------------QQ
QQQQQQQAVAAAAVQQSTSQQATQGT 114
QQQQQQQQQQQQQQQ QQQQQQQQQ
QQ QQ Q Sbjct 604 QQQQQQQQQQQQQQQSSISAQQQQQQ
QSSISAQQQQQQQQQQQQQQQQQQQQQQQQQQQP 663 Query
115 SGQAPQLFHSQTLTTAPLP 133 S Q Q
SQ L PLP Sbjct 664 SSQPAQSLPSQPLLRSPLP 682
EPFL Bioinformatics I 14 Nov 2005
5
Example of an alignment due to homology and
structural similarity found with the same query
Query sequence Human TATA-box binding protein TBP
gtspO74045TBP_CERSY (tbp)TATA-box binding
protein (TATA-box factor) (TATA
sequence-binding protein) (TBP) (Box A binding
protein) (BAP).Cenarchaeum symbiosum
Length 182 Score 47.4 bits (111),
Expect 7e-05 Identities 27/71 (38),
Positives 44/71 (61), Gaps 1/71 (1) Query
161 IVPQLQNIVSTVNLGCKLDLKTIALRARNAEYNPKRFAAVIMRIRE
PRTTALIFSSGKMV 220 I P NIVTV G
I R A YP F I LFSGKMV Sbjct
98 IRPVVRNIVATVDAGRNVPIDRISSRMPGAVYDPGSFPGMILKGLD
-SCSFLVFASGKMV 156 Query 221 CTGAKSEEQSR 231
GAKS R Sbjct 157 IAGAKSPDELR 167
EPFL Bioinformatics I 14 Nov 2005
6
Different flavors of the Blast program
blastp compares an amino acid query sequence
against a protein sequence database blastn
compares a nucleotide query sequence against a
nucleotide sequence database blastx compares the
six-frame conceptual translation products of a
nucleotide query sequence (both strands) against
a protein sequence database tblastn compares a
protein query sequence against a nucleotide
sequence database dynamically translated in all
six reading frames (both strands).
tblastx compares the six-frame translations of a
nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
EPFL Bioinformatics I 14 Nov 2005
7
More on the BLAST heuristics for protein sequence
similarity searches
1. Query compilation For each position of the
query sequence, compile a list of 3-letter words
which match the query with score threshold T
(score depends on substitution matrix). 2. Word
search Search for two non-overlapping word
matches on the same diagonal of the path matrix
within a distance A. 3. Un-gapped match
extension Extend the un-gapped alignment along
the diagonal in both directions unless the score
drops more than X below the maximal score yet
attained. 4. Gapped alignment extension For
un-gapped alignments (segment pairs) exceeding a
threshold score Sg , trigger a regional gap
alignment unless the score drops more than Xg
below the maximal score yet attained. 5.
Statistical evaluation of gapped alignments
Blast uses hard-wired extreme value distribution
parameters ? and K for specific scoring systems
(substitution matrix gap penalties)
EPFL Bioinformatics I 14 Nov 2005
8
Examples of word match scores with Blosum62 matrix
WWW -gt 33 WWW YYY -gt 21 YYY GGG -gt
18 GGG QQQ -gt 15 QQQ SSS -gt 13 SSS
HSW -gt 20 HAW LFY -gt 12 IYY EWD -gt
15 DWE KFI -gt 8 RYV EWC -gt -11 FDE
EPFL Bioinformatics I 14 Nov 2005
Write a Comment
User Comments (0)
About PowerShow.com