Title: Waiting Time and Seed Selection for Homology Search
1Waiting Time and Seed Selection for Homology
Search
- Gary Benson
- Department of Computer Science
- Department of Biology
- Graduate Program in Bioinformatics
- Boston University
2Classroom Experiment Estimating Waiting Time
- Each student tosses a coin until the first
occurrence of 3 heads in a row. This is one
trial. If ten tosses do not produce three heads
in a row, stop and begin a new trial. - Repeat trials, if necessary, so that there are
100 trials across all the students in the class.
- For each trial, record in a table which toss (if
any) gives the first occurrence of three heads in
a row.
3Example
- 1 2 3 4 5 6 7 8 9 10
- Trial 1 H T H H T H H H 8
- Trial 2 T T H H H 5
- Trial 3 H T T H H T H H H 9
- Trial 4 H T T H H T H T H H None
First occurrence
4Fill in Table
5Class Results
6ACGTTCGACT
ACGTTCGACT
ACGTTCGGCT
ACGTTCGACT
AAGTTCGACT
ACGTTCGGCT
Evolutionary time
AAGTTCGACT
ACCTTCGGCT
AAGTTCGACA
ACCTTCGGCT
Mouse
Human
Molecular evolution
7Biologists Ask
- I have an unstudied human gene sequence
- AAGTTCGACA
- What other genes (from other organisms)
- are similar to this one?
8Why do they ask that question?
- Genes that have similar sequences often
- produce proteins that have similar function
- are regulated in similar ways
- So a gene that has already been studied can often
provide information about an unknown gene.
9How we answer the question
- Compare the human gene to
-
- every sequence
-
- in a database of known genomes
10How BIG is the database?
- Combined length of all sequences
- is greater than
- 1011 100,000,000,000 nucleotides.
11http//www.ncbi.nlm.nih.gov/
12http//www.ncbi.nlm.nih.gov/
13(No Transcript)
14How long does it take to search?
- The best algorithm (not the fastest) for
detecting similarity takes time proportional to
the product of the lengths of the sequence and
the database. - Assume the unstudied gene sequence is
- 102 100 nucleotides and the database is
- 1011 100,000,000,000 nucleotides.
15Searching the database
- 102 1011 1013 nucleotide comparisons.
- Good desktop computers now run at
- 3 gigahertz 3 billion 3 109 comparisons per
second - So comparing one sequence to the database takes
- 1013 / 3 109 3333 seconds 55 minutes
16Searching the database
- 102 1011 1013 nucleotide comparisons.
- Good desktop computers now run at
- 3 gigahertz 3 billion 3 109 comparisons per
second - So comparing one sequence to the database takes
- 1013 / 3 109 3333 seconds 55 minutes
- UGH!!!
17A faster way
- Take small words from A and find matches in the
database first - AAGTTCGACA
- AAGTT AGTTC GTTCG CGACA
- Then only do comparisons where there is a hit.
18Example
- Sequence AAGTTCGACA
- Targets AAGTT AGTTC GTTCG CGACA
- Database
- ACTGGTTCAAATGGCGCATGCAAAAGTTGGCTGATTTGCATGACGTACCC
TGAGACCTCGGAATTCTAGCTTGCGAAGTAATACGATACCGTACGTTGCC
GACATACGGTACGTCGTCTACGTACGTACGCCTACGCTACGTACCTTCGG
CTTTTCATGGCAGCGATCGTACTCCTCTAGTTCCTGACTGACTAC
19Example
- Sequence AAGTTCGACA
- Targets AAGTT AGTTC GTTCG CGACA
- Database
- ACTGGTTCAAATGGCGCATGCAAAAGTTGGCTGATTTGCATGACGTACCC
TGAGACCTCGGAATTCTAGCTTGCGAAGTAATACGATACCGTACGTTGCC
GACATACGGTACGTCGTCTACGTACGTACGCCTACGCTACGTACCTTCGG
CTTTTCATGGCAGCGATCGTACTCCTCTAGTTCCTGACTGACTAC
But we missed the mouse gene!
20What size should the words be?
- Too long and we miss similar sequences
- This reduces sensitivity.
- Too small and there are hits everywhere.
- This reduces specificity.
- We need to balance these two issues.
21Computing sensitivity
- We need to answer the following type of question
- Suppose a gene sequence of length 100 has a match
in the database and the two sequences are
expected to be identical at 80 of their
positions. If we search with words of length 5,
how often will we find the match?
22Computing sensitivity
- We model the two sequences with coin tosses,
where a match is considered a head and a mismatch
(mutation) is considered a tail. - Example
- AAGTTCGACA
- ACCTTCGGCT
- HTTHHHHTHT
23Computing sensitivity
- We model the two sequences with coin tosses,
where a match is considered a head and a mismatch
(mutation) is considered a tail. - Example
- AAGTTCGACA
- ACCTTCGGCT
- HTTHHHHTHT
We are interested in at least one occurrence of
five heads if we use small words of length five.
24Waiting Time
- Our question is a classic waiting time problem.
It asks, for coin toss sequences of length n, and
with probability of heads P(H), what is the
probability of tossing k heads in a row at least
once? - For our question,
- n 100,
- P(H) 0.8
- k 5
25Answer
- 99.99 -- Which means that using small words of
length 5, we will never miss similar sequences
of length 100. - In fact, the sequences have to be as short as 23
nucleotides before we will miss 5. - And at 9 nucleotides we miss 41 of similarities.
- AAGTTCGACA
- ACCTTCGGCT
26But wait! Maybe the words can be modified
- It turns out, that words composed of letters with
dont care spaces produce better results. - AAGTTCGACA
- AATTC AGTCG GTCGA TCACA
27But wait! Maybe the words can be modified
- It turns out, that words composed of letters with
dont care spaces produce better results. - AAGTTCGACA
- AATTC AGTCG GTCGA TCACA
- Current research involves finding the best word
shapes.