Title: 72x48 Poster Template
1SSAHA Search with Speed Nick Altemose, Kelvin
Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke
University Focus Program The Genome Revolution
and Its Impact on Society
How Does a Hash Table Work?
Introduction
APT k-tuple Coordinates in a Hash Table
SSAHA Step-by-Step
- The Human Genome contains an immense amount of
information 3 Billion base pairs, 3 Gigabytes of
storage space. In fact, if you were to read it
aloud continuously at the alarming rate of 10
base pairs per second, it would take you 9.5
years to get through it all. - When you want to search for a sequence in the
genome, you must use smart search methods like
those used by internet search engines. If you
were to search through the entire genome one base
pair at a time, it could take hours to get back
any results, even with the best computers. - SSAHA is a search method that was developed with
one goal in mind speed. - SSAHA takes a database of data, like the 3
Gigabase human genome reference sequence, and
organizes it into a fast searchable index, known
as a hash table. - This organization is the key to SSAHAs speed,
and the origin of its name Sequence Search and
Alignment by Hashing Algorithm. - The hash table is like a dictionary to the
English language, or an index to an encyclopedia.
It allows for fast lookup of organized
information. After making this hash table, it is
then very easy to search for a given sequence.
- Problem Statement
- In this APT, you will generate a string similar
to a single hash table entry. Given an array of
DNA strings (our database) and a dimer string
(our 2-tuple), output a string containing the
positions of every occurrence of that 2-tuple in
the database (not limited to a single reading
frame). Positions will be in the form of
(0-initiated array element index, 0-initiated
character index) and should be ordered ascending
numerically first by array element index, then by
character index. Note that these positions will
be 0-initiated and not 1-initiated like actual
SSAHA positions. - Notes
- Positions should be space-delimited in the
string. For example, an output string containing
5 positions should have this format - (0,2) (1,3) (2,5) (5,3) (6,7)
- Constraints
- database can contain strings of any length,
including 0 and 1. - All input strings will contain only the
characters A, G, C,or T in any
combination. - twotuple will always have exactly 2 characters.
- Examples
- 1. ATTCGT, TACGG, GAATCA, G
- CG
- Returns (0, 3) (1, 2)
- 2. TTCGT, AAATC, ATATATTTA, TCGAAATG
- AT
- Returns (1, 2) (2, 0) (2, 2) (2, 4)
(3, 6) - 3. TTTTT, ATTCG
- TT
- Returns (0,0) (0, 1) (0, 2) (0, 3)
(1, 1)
1) Break sequences in database into
k-tuples. 2) Make a hash table. Ideally,
each cell in the hash table is allotted for one
particular unique k-tuple. Each cell
contains the locations where that particular
unique k-tuple occurs. The location specifies the
sequence where the k-tuple occurs, and the index
in the sequence where the k-tuple occurs. 3) The
query sequence is also broken into
k-tuples. 4) These query k-tuples are
matched to k-tuples in the database. Matches are
noted. 5) SSAHA returns search results,
listing sequences from the database that best
match the query sequence. The degree of
similarity is based on the amount of matching
k-tuples, the successiveness in which they match
(this involves pathingnot explained here) etc.
and attempts to account for insertions, deletions
and SNPs.
- The hash table stores positions at which
different short sequences, called k-tuples, occur
in the database. A position looks like this
(index, offset), where index identifies which DNA
strand contains the k-tuple, and offset
identifies at what position in the DNA strand the
k-tuple begins. - In lay terms, SSAHA likes to keep things
organized, like an anal retentive woman who puts
everything in labeled Tupperware, which she then
organizes into labeled drawers. The index is like
the labeled drawer, and the offset is like the
labeled Tupperware. If you tell her to get
something from the third drawer, fourth
Tupperware, she can quickly and easily locate
what youre looking for. Or, if you ask her where
you can find something, she can easily tell you
in which drawer and Tupperware youll find it. - You only have to make a hash table once for a
given database, kind of like you only have to
tidy your room up once, then you can find things
extremely easily thereafter.
Visual of Hash Table Search
How Does SSAHA Code for Base Pairs?
- SSAHA stores individual nucleotide base
information as 2-bits of information per base - A is stored as 00
- C is stored as 01
- G is stored as 10
- T is stored as 11
- This allows for fast and efficient storage and
processing, using less memory and less processor
power - However, the drawback is that only 4 characters
are possible, and sometimes real DNA data has
gaps and ambiguous characters - Other search engines treat these ambiguities as
separate letters, like R or N (N means it could
be any base) - SSAHA reverts all ambiguities to A, since AAAAA
is the most common word in the genome, and the
search can easily recognize it as uninformative
Contact Information
Nick Altemose, Student, Duke University Trinity
College of Arts and Sciences email
nick.altemose_at_duke.edu Owen Astrachan,
Professor, Duke University Department of Computer
Science email ola_at_cs.duke.edu Kelvin
Gu, Student, Duke University Trinity College of
Arts and Sciences email
kelvin.gu_at_duke.edu Tiffany Lin, Student, Duke
University Trinity College of Arts and Sciences
email tiffany.lin_at_duke.edu Kevin Tao,
Student, Duke University Pratt School of
Engineering email kevin.tao_at_duke.edu
Works Referenced
Genome Size Data from Department of Energy Website http//www.ornl.gov/sci/techresources/Human_Genome/faq/faqs1.shtml SSAHA data from Genome Research 2001 volume 11, pages 1725-1729 SSAHA A Fast Search Method for Large DNA Databases