Sequence Databases - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Sequence Databases

Description:

These algorithms for finding optimal alignment takes time proportional to the ... BLAST tries to extend an alignment from the matching words in both directions, ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 31
Provided by: PCAdmini7
Category:

less

Transcript and Presenter's Notes

Title: Sequence Databases


1
Sequence Databases
  • As DNA and protein sequences accumulate, they are
    deposited in public databases.
  • One of the most popular of these is GenBank,
    which stores all types of different sequences in
    plain text files, where different features are
    separated by blank spaces and special symbols.
  • Each file can be downloaded directly. Sample
    file
  • LOCUS name of locusSOURCE source organism of
    DNAFEATURES information about sequence by base
    positionORIGIN 1 gaattcgata aatctctggg
    ttattgtgac gtttataatg acgttaggca 51
    atatcattct atcattaagc

2
Relational Databases
  • A more structured way to store data is by using
    relational databases. A database is made up of a
    set of tables which stores data in a
    non-redundant manner to facilitate data updates.
  • student uin firstname lastname 123456789 Alice
    Smith 333445555 Tom Wang
  • major uin department 123456789 BICH 12345678
    9 CPSC 333445555 BICH

3
Query Languages
  • Instead of allowing users to look at the contents
    of a database directly, access is through the use
    of a query language over a connection to the
    database.
  • SQL is one of the most popular query languages.
    The following is a sample SQL query
  • select firstname, lastname from student,
    major where student.uin major.uin and
    major.department BICH
  • This approach imposes a structure to the data and
    allows updates to be seen instantaneously.

4
Graphical User Interface
  • Fortunately, there is no need to learn a query
    language in order to be able to access a
    relational database.
  • Modern database software packages provide
    graphical user interfaces developed on top of the
    query language level, so that users dont need to
    know anything about query languages.
  • Relational databases are very popular within
    biotech companies and web interfaces are becoming
    standard.

5
Comparing Sequences
  • Mutation in DNA sequences is modeled by three
    basic operations insertion of a nucleotide,
    deletion of a nucleotide and substitution of a
    nucleotide.
  • Mutation is tolerated to a certain extent before
    the function of a DNA sequence is lost or
    changed. Thus, similarity between sequences
    reflects similarity in function.
  • As DNA sequences of different organisms
    accumulate, comparing a newly discovered sequence
    to known sequences helps to identify the function
    of the new sequence.

6
Pairwise Alignment
  • Given two (DNA or protein) sequences, an
    alignment puts the two strings against each other
    so that similar parts are aligned together.
  • For example, given two strings ATCTCGAT and
    TGCATAT, one alignment is
    AT-C-TCGCT -TGCAT--ATwhere -
    denotes a gap introduced to pad non-similar
    parts.
  • Given a scoring scheme, the pairwise alignment
    problem is the problem of finding the best
    alignment with optimal score. A special form of
    alignment called local alignment allows
    alignments to not cover entire sequences.

7
Finding Optimal Alignment
  • Dynamic programming techniques are used to find
    the optimal alignment (which use a
    divide-and-conquer approach to express the
    original problem as smaller subproblems).
  • These algorithms for finding optimal alignment
    takes time proportional to the product of the
    length of input sequences.
  • In large scale comparisons, heuristics are
    commonly employed to find good alignment (which
    is not necessarily optimal) between two sequences.

8
Database Search
  • As the amount of sequenced DNA increases, it is
    likely that a new DNA sequence has similarity to
    at least one of the known sequences.
  • New DNA sequences are annotated as much as
    possible and collected in sequence databases such
    as GENBANK.
  • When a new DNA sequence is obtained, the first
    logical step is to try to search for similar
    sequences in a database of known sequences.
  • If similar sequences are found from database
    search, sequence similarity would imply
    similarity in function.

9
Filtration Idea
  • Given a database of sequences s1, , sn and a
    sequence s, the first step is to determine which
    of the pairs (s, si) are likely to be similar.
  • There are conflicting objectives
  • Dont want to lose real similarities.
  • Want to cut down the number of possible
    similarities.
  • The simplest way to perform such a filtration is
    to consider a k-mer appearing in s and check if
    it appears in each of si.

s1
s2
s
s3
s4
10
Seed-Extension Approach
  • The next step is to extend the short similarities
    (seeds) we found in the filtration step to see if
    the gaps between the short similar areas can be
    filled.
  • All hits with strong similarity found are sorted
    in decreasing order of score.
  • It is possible to compute the optimal alignment
    in detail for a limited subset of high scoring
    hits.

11
BLAST
  • BLAST is one of the most popular programs people
    use to perform database search.
  • There are various variations of the basic
    programs to compare DNA against DNA, DNA against
    protein, protein against protein, etc.
  • In addition to employing the filtration-extension
    approach, BLAST employs an additional set of
    heuristics to optimize its performance.

12
BLAST Seed Selection
  • BLAST uses different seed lengths for different
    types of sequences
  • use words of length 11 for DNA sequences.
  • use words of length 3 for protein sequences.
  • perform a translation from DNA to protein for
    comparisons involving DNA and proteins. All
    three reading frames are investigated.
  • For protein sequences, in addition to look for
    exact matches, BLAST also looks for approximate
    matches with score above a preset threshold.

13
BLAST Alignment Extension
  • BLAST tries to extend an alignment from the
    matching words in both directions, and continue
    the extension as long as the score continues to
    increase. If this results in a gap between two
    matching words being filled, the extension is
    successful.




s



si
14
Multiple Alignment
  • It is frequently very difficult to reveal weak
    similarities between two sequences.
  • These similarities can often be captured by
    comparing multiple sequences.
  • Multiple alignment generalizes the alignment idea
    to handle many sequences.
  • AT-C-TCGAT
    -TGCAT--AT ATCCA-CGCT

15
ClustalW
  • ClustalW is a very popular software for multiple
    alignment.
  • It uses the progressive alignment idea treat
    each sequence as an alignment initially and
    iteratively select the next two most similar
    alignments to be combined into a bigger alignment
    by pairwise alignment techniques.
  • The order to combine the alignments is determined
    by following an evolutionary tree generated on
    the fly between the sequences.

?

16
TCoffee
  • When computing an initial pairwise alignment,
    TCoffee incorporates consistency information from
    other sequences to improve its consistency within
    the final multiple alignment
  • It accepts different kinds of pairwise alignments
    as input, and constructs a weighted library to
    represent the pairwise relationships.
  • For each sequence pair A and B, it extends the
    original library by trying to align A and B
    through each sequence C in the input set and
    adding consistency information to the (A,B) pair.
  • TCoffee is much more accurate than ClustalW.

17
Gene Prediction
  • In eukaryotes, genes are made up of exons and
    introns. Introns are cut out of the RNA and
    exons are concatenated together to form the mRNA
    before translation into a protein.
  • The gene prediction problem is the following
    problem Given a genomic sequence, locate the
    exon-intron boundaries.

exon
exon
exon
intron
intron
gene (RNA)
mRNA
18
GENSCAN
  • GENSCAN is one of the most popular software
    programs used for gene structure prediction.
  • It integrates multiple types of information in a
    hidden markov model (HMM), including promoter
    signals, splicing signals, length distributions
    for exons and introns, poly-A transcriptional end
    signal, etc.
  • A HMM consists of a set of states and transition
    probabilities between the states. Each state
    represents a modeled feature.
  • Although the structure of the HMM is
    pre-determined, it is necessary to provide
    estimates of other parameters (such as transition
    probabilities) via a set of training examples.

19
Similarity-based Approaches
  • As the amount of sequenced DNA increases, it is
    likely that a new (unspliced) genomic sequence
    has similarity to a known cDNA or protein.
  • BLAST can be used to find similar sequences from
    the database of known sequences. These BLAST
    hits reveal only the most significant local
    alignments between the genomic sequence and a
    similar sequence, which usually represent one or
    few exons.
  • The GenomeScan software integrates this
    information with the HMM model of GENSCAN.

20
Regulation of Gene Expression
  • Gene regulatory proteins bind to specific places
    (regulatory sites) on DNA. These sites are
    usually close to the gene.

off
gene
site
regulatory protein
on
gene
site
21
Finding Regulatory Sites
  • Identify a set of genes believed to be controlled
    by the same regulatory mechanism (co-regulated
    genes).
  • Extract regulatory regions of the genes (usually
    upstream sequences) to form a sample of
    sequences.
  • Find some way to identify conserved patterns
    shared by these sequences, resulting in a list of
    potential regulatory sites.
  • The problem of finding patterns shared by a given
    sample of sequences is called the motif finding
    problem.

22
Motif Finding Problem
sample
gene
site
gene
site
gene
site
gene
site
gene
site
23
MEME
  • MEME is one of the most popular programs for
    motif finding. It finds a motif, represented by
    a set of ungapped patterns, which is the least
    likely to appear by chance.
  • It uses the expectation-maximization (EM)
    approach first obtain an initial motif (which
    may not be very good), then iteratively obtain a
    better motif with the following two steps
  • Expectation compute the statistical composition
    of the current motif and find the probability of
    finding the site at each position in each
    sequence.
  • Maximization These probabilities are used to
    update the statistical composition.

24
Finding Regulatory Sites by Alignment
  • Recall that one way to find regulatory sites is
    to construct samples of upstream regions of
    co-regulated genes and use motif finding
    algorithms to look for shared patterns.
  • These samples of co-regulated genes are
    constructed from diverse sources. Regulatory
    sites can reside on either strand and they are
    not at a fixed distance from the transcriptional
    start site.
  • For these samples, alignment algorithms are not
    useful. However, alignment approaches can be
    very useful when we specifically consider
    co-regulated genes which are very close in
    evolutionary distance.

25
What to Look For
  • There are frequently more than one regulatory
    site for a gene. When sufficiently close
    upstream regions of co-regulated genes are
    aligned, we expect the sites to be short, well
    conserved, in the same order and on the same
    strand. These short blocks of highly conserved
    positions are separated by less conserved
    regions.
  • Although evolutionarily close genes should be
    used, they should not be so close that the entire
    upstream region is highly conserved, which does
    not give much information.

26
Using Closely Related Yeast Species
  • Align upstream sequences from a few Saccharomyces
    species to try to identify regulatory elements
  • Many alignments of S.mikatae and S.bayanus
    promoter regions with S.cerevisiae show short
    runs of conserved sequences, which predicts
    regulatory sites.
  • The two species S.cariocanus and S.paradoxus are
    too closely related to S.cerevisiae. Not much
    information is revealed in their alignments.

27
Inferring Evolution
  • Using estimates on the evolutionary distance
    between multiple organisms, an evolutionary tree
    can be constructed predicting the evolutionary
    relationships.
  • It is possible that conflicting trees are
    produced from different data sources. We are
    much more confident that a tree is correct when
    the same tree is produced from various data
    sources.

mouse
rat
human
28
Evolutionary Distance Estimates
  • Estimates of evolutionary distance
  • Each multiple alignment gives distance estimates
    between species. To obtain more accurate
    estimates, a large set of multiple alignments
    including different regions or genes should be
    utilized collectively.

29
Using Evolutionary Information in Alignment
  • Progressive multiple alignment algorithms often
    construct an evolutionary tree on the fly to
    determine the order in which smaller alignments
    are merged to form bigger ones.
  • If the sequences to be aligned are from different
    species, it is also possible to utilize a
    pre-determined evolutionary tree.
  • Since an evolutionary tree can be constructed
    given a multiple alignment, it is possible to
    iteratively improve both the evolutionary tree
    and the multiple alignment from crude initial
    estimates.

30
Using Evolutionary Information in Motif Finding
  • The motif finding problem is to find a pattern
    shared by a given set of sequences. In the case
    when the sequences are from different species,
    evolutionary information can be utilized to get
    more accurate results.
  • Given an evolutionary tree of the represented
    species, a better way to score a motif is to
    compute the total number of mutations along the
    branches of the tree. It has been shown that
    this approach produces better motifs.
Write a Comment
User Comments (0)
About PowerShow.com