Waiting Time and Seed Selection for Homology Search - PowerPoint PPT Presentation

About This Presentation
Title:

Waiting Time and Seed Selection for Homology Search

Description:

Classroom Experiment Estimating Waiting Time ... UGH!!! A faster way. Take small 'words' from A and find matches in the database first: ... – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 28
Provided by: LBI74
Category:

less

Transcript and Presenter's Notes

Title: Waiting Time and Seed Selection for Homology Search


1
Waiting Time and Seed Selection for Homology
Search
  • Gary Benson
  • Department of Computer Science
  • Department of Biology
  • Graduate Program in Bioinformatics
  • Boston University

2
Classroom Experiment Estimating Waiting Time
  • Each student tosses a coin until the first
    occurrence of 3 heads in a row. This is one
    trial. If ten tosses do not produce three heads
    in a row, stop and begin a new trial.
  • Repeat trials, if necessary, so that there are
    100 trials across all the students in the class.
  • For each trial, record in a table which toss (if
    any) gives the first occurrence of three heads in
    a row.

3
Example
  • 1 2 3 4 5 6 7 8 9 10
  • Trial 1 H T H H T H H H 8
  • Trial 2 T T H H H 5
  • Trial 3 H T T H H T H H H 9
  • Trial 4 H T T H H T H T H H None

First occurrence
4
Fill in Table
5
Class Results
6
ACGTTCGACT
ACGTTCGACT
ACGTTCGGCT
ACGTTCGACT
AAGTTCGACT
ACGTTCGGCT
Evolutionary time
AAGTTCGACT
ACCTTCGGCT
AAGTTCGACA
ACCTTCGGCT
Mouse
Human
Molecular evolution
7
Biologists Ask
  • I have an unstudied human gene sequence
  • AAGTTCGACA
  • What other genes (from other organisms)
  • are similar to this one?

8
Why do they ask that question?
  • Genes that have similar sequences often
  • produce proteins that have similar function
  • are regulated in similar ways
  • So a gene that has already been studied can often
    provide information about an unknown gene.

9
How we answer the question
  • Compare the human gene to
  • every sequence
  • in a database of known genomes

10
How BIG is the database?
  • Combined length of all sequences
  • is greater than
  • 1011 100,000,000,000 nucleotides.

11
http//www.ncbi.nlm.nih.gov/
12
http//www.ncbi.nlm.nih.gov/
13
(No Transcript)
14
How long does it take to search?
  • The best algorithm (not the fastest) for
    detecting similarity takes time proportional to
    the product of the lengths of the sequence and
    the database.
  • Assume the unstudied gene sequence is
  • 102 100 nucleotides and the database is
  • 1011 100,000,000,000 nucleotides.

15
Searching the database
  • 102 1011 1013 nucleotide comparisons.
  • Good desktop computers now run at
  • 3 gigahertz 3 billion 3 109 comparisons per
    second
  • So comparing one sequence to the database takes
  • 1013 / 3 109 3333 seconds 55 minutes

16
Searching the database
  • 102 1011 1013 nucleotide comparisons.
  • Good desktop computers now run at
  • 3 gigahertz 3 billion 3 109 comparisons per
    second
  • So comparing one sequence to the database takes
  • 1013 / 3 109 3333 seconds 55 minutes
  • UGH!!!

17
A faster way
  • Take small words from A and find matches in the
    database first
  • AAGTTCGACA
  • AAGTT AGTTC GTTCG CGACA
  • Then only do comparisons where there is a hit.

18
Example
  • Sequence AAGTTCGACA
  • Targets AAGTT AGTTC GTTCG CGACA
  • Database
  • ACTGGTTCAAATGGCGCATGCAAAAGTTGGCTGATTTGCATGACGTACCC
    TGAGACCTCGGAATTCTAGCTTGCGAAGTAATACGATACCGTACGTTGCC
    GACATACGGTACGTCGTCTACGTACGTACGCCTACGCTACGTACCTTCGG
    CTTTTCATGGCAGCGATCGTACTCCTCTAGTTCCTGACTGACTAC

19
Example
  • Sequence AAGTTCGACA
  • Targets AAGTT AGTTC GTTCG CGACA
  • Database
  • ACTGGTTCAAATGGCGCATGCAAAAGTTGGCTGATTTGCATGACGTACCC
    TGAGACCTCGGAATTCTAGCTTGCGAAGTAATACGATACCGTACGTTGCC
    GACATACGGTACGTCGTCTACGTACGTACGCCTACGCTACGTACCTTCGG
    CTTTTCATGGCAGCGATCGTACTCCTCTAGTTCCTGACTGACTAC

But we missed the mouse gene!
20
What size should the words be?
  • Too long and we miss similar sequences
  • This reduces sensitivity.
  • Too small and there are hits everywhere.
  • This reduces specificity.
  • We need to balance these two issues.

21
Computing sensitivity
  • We need to answer the following type of question
  • Suppose a gene sequence of length 100 has a match
    in the database and the two sequences are
    expected to be identical at 80 of their
    positions. If we search with words of length 5,
    how often will we find the match?

22
Computing sensitivity
  • We model the two sequences with coin tosses,
    where a match is considered a head and a mismatch
    (mutation) is considered a tail.
  • Example
  • AAGTTCGACA
  • ACCTTCGGCT
  • HTTHHHHTHT

23
Computing sensitivity
  • We model the two sequences with coin tosses,
    where a match is considered a head and a mismatch
    (mutation) is considered a tail.
  • Example
  • AAGTTCGACA
  • ACCTTCGGCT
  • HTTHHHHTHT

We are interested in at least one occurrence of
five heads if we use small words of length five.
24
Waiting Time
  • Our question is a classic waiting time problem.
    It asks, for coin toss sequences of length n, and
    with probability of heads P(H), what is the
    probability of tossing k heads in a row at least
    once?
  • For our question,
  • n 100,
  • P(H) 0.8
  • k 5

25
Answer
  • 99.99 -- Which means that using small words of
    length 5, we will never miss similar sequences
    of length 100.
  • In fact, the sequences have to be as short as 23
    nucleotides before we will miss 5.
  • And at 9 nucleotides we miss 41 of similarities.
  • AAGTTCGACA
  • ACCTTCGGCT

26
But wait! Maybe the words can be modified
  • It turns out, that words composed of letters with
    dont care spaces produce better results.
  • AAGTTCGACA
  • AATTC AGTCG GTCGA TCACA

27
But wait! Maybe the words can be modified
  • It turns out, that words composed of letters with
    dont care spaces produce better results.
  • AAGTTCGACA
  • AATTC AGTCG GTCGA TCACA
  • Current research involves finding the best word
    shapes.
Write a Comment
User Comments (0)
About PowerShow.com