Title: What data
1What data?
- Today Structural Genomics
- Friday Data Base Resources
2Structural Genomics
- Characterizing and locating the entire set of
genes in a genome.
3Structural Genomic Strategies 1
- Ordered Approach
- Order clones along the genome, then sequence,
- not dependent on acceleration of sequencing
capacity, - not dependent on advanced computer analysis,
- not dependent on as-of-yet sequencing
technologies. - heavy up-front demand for human labor.
4Sequence Ready Ordered Approach
Stitch, then sequence.
5Structural Genomic Strategies 2
- Shotgun Approach
- Sequence first, then order,
- dependent on advances in computer analysis and
sequencing technologies, - dependent on automated labor.
6Genomes...T.A. Brown
- The big question is whether the 70 million
sequences could be assembled correctly... - ...if the conventional shotgun approach is used,
which makes no reference to a genome map, then
the answer is certainly no.
Wrong, the answer is Yes, but the truth is,
with a map the sequence is better.
7First the Genome(s) Celera
- 5 individuals
- two males / three females
- one African American
- one Asian Chinese
- one Hispanic Mexican
- two Caucasians
- take tissue and sperm samples, immortilize,
- extract DNA,
- shred and package in vectors.
8Bacterial Artificial ChromosomesBACs
- F plasmid ancestry,
- maintain bacterial replication system and copy
number control system.
9Science 291 (5507), 1304-1351
8, September 1999 - 25, June 2000
10Single Strand PCR
dNTPs
5 - ATACATACTACTAACTAACTAA - 3
3 - TATGTATGATGATTGATTGATT
- 5
Template
1 Primer
Taq Polymerase w/ Buffer
Cycles
Polymerization until Taq falls off, linear
amplification.
11Cycle SequencingChain Termination
ddNTPs
dNTPs
5 - ATACATAC - 3
3 - TATGTATGATGATTGATTGATT - 5
Template
1 Primer
Taq Polymerase w/ Buffer
Cycles
Polymerization until Taq hits ddNTP, linear
amplification.
12Fluorescent ddNTPs
13(No Transcript)
14ABI 3700
- Automated,
- Capillary Action,
- 15 minutes a day maintenance,
- 65 full-time staff.
15Systems Biology
16Mate Pairs
- BAC End Sequencing,
- sequence both ends of the BAC using vector
derived primers.
17Science 291 (5507), 1304-1351
8, September 1999 - 25, June 2000
18(No Transcript)
19Whole Genome Assembly
20WGA
- 1. Screener
- 2. Overlapper
- 3. Unitigger,
- 4. Scaffolder,
- 5. Repeat Resolver.
21Screener
- ...finds and masks microsatellite repeats,
known repeat regions and ribosomal DNA, - marks the rest for overlapping.
22Overlapper
- ...looks for end-to end overlaps of at least 40
bp with no more than 6 differences in match.
Whats the significance?
...a one in 1017 event.
23But(t)!
- ...the Screener doesnt include all of the low
frequency level repeats, - ...so, a majority of the Overlapper outputs are
bogus.
24What Now?
- ...some uniquely assembled contigs (unitigs) are
readily identifiable, - all of the assembled sequences match over all of
the known sequence,
- and -
- ...are consistent with an 8x coverage,
- over-collapsed assemblies are identified and
broken down into unitigs when possible.
25Scaffolder
- ...contigs the contigs,
- uses mate-pair information.
-
26Repeat Resolver
- ...most of the remaining gaps were due to repeats.
...91 sequence, 9 gaps,
- Gaps,
- average 2.43 kb,
- over 50 lt 500 bp,
- over 62 lt 1 kb,
- no gap larger than 100 kb.
27Scaffolds
28Friday
- Databases,
- Database Mining.