Title: Assembling and Annotating the Draft Human Genome
1Evolution and the Santa Cruz Genome Browser
Jim Kent and the Genome Bioinformatics
Group University of California Santa
Cruz Pennsylvania State University
2Typical Gene Level View
Sialic Acid Binding/Ig-like Lectin 7
3Typical Gene Level View
Sialic Acid Binding/Ig-like Lectin 7
4Known Gene Details Page
5Known Gene Details Page
6PDB Ribbon Diagram
4 clicks away by the wonder of the world wide web
7Hox A Cluster, Many Tracks
8Track Controls are Now Grouped
9Packed mode saves space, makes labels easier to
find.
10Squished mode is ideal for ESTs and mouse/human
homology
11Squished mode is ideal for ESTs and mouse/human
homology
ESTs hint at a smallerversion of exon2
12Publication Quality Output
13Comparative Genomics
14Chaining Alignments
- Chaining bridges the gulf between syntenic blocks
and base-by-base alignments. - Local alignments tend to break at transposon
insertions, inversions, duplications, etc. - Global alignments tend to force non-homologous
bases to align. - Chaining is a rigorous way of joining together
local alignments into larger structures.
15Chains join together related local alignments
Protease Regulatory Subunit 3
16Affine penalties are too harsh for long gaps
Log count of gaps vs. size of gaps in mouse/human
alignment correlated with sizes of transposon
relics. Affine gap scores model red/blue plots as
straight lines.
17Gaps are needed in Both Sequences in the General
Case of Pair-Wise Alignment
otherwise non-homologous bases can be forced to
pair
182-D histogram of observed gaps.
The horizontal axis is gaps in human, the
vertical axis is gaps in mouse. The logarithm of
counts of gaps in bins of 10 (left) and bins of
500 (right) are plotted as levels of gray with
black representing the highest counts. Note the
concentration of gaps along the axis,
particularly for shorter gaps.
19Before and After Chaining
20Chaining Algorithm
- Input - blocks of gapless alignments from blastz
- Dynamic program based on the recurrence
relationship score(Bi) max(score(Bj)
match(Bi) - gap(Bi, Bj)) - Uses Millers KD-tree algorithm to minimize which
parts of dynamic programming graph to traverse.
Timing is O(N logN), where N is number of blocks
(which is in hundreds of thousands)
jlti
21Netting Alignments
- Commonly multiple mouse alignments can be found
for a particular human region, particularly for
coding regions. - Net finds best match mouse match for each human
region. - Highest scoring chains are used first.
- Lower scoring chains fill in gaps within chains
inducing a natural hierarchy.
22Net Focuses on Ortholog
23Net highlights rearrangements
A large gap in the top level of the net is filled
by an inversion containing two genes. Numerous
smaller gaps are filled in by local duplications
and processed pseudo-genes.
24Useful in finding pseudogenes
Ensembl and Fgenesh automatic gene predictions
confounded by numerous processed pseudogenes.
Domain structure of resulting predicted protein
must be interesting!
25Mouse/HumanRearrangement Statistics
Number of rearrangements of given type per
megabase.
26A Rearrangement Hot Spot
Rearrangements are not evenly distributed.
Roughly 5 of the genome is in hot spots of
rearrangements such as this one. This 350,000
base region is between two very long chains on
chromosome 7.
27year of the rat - 2008
Rat Genome
28Rat/Mouse/Human Genome-Wide Multiz Alignments
Available
Eye lense protein gamma crystallin a. Upstream
region (on right) is highly conserved but not a
CpG island. Alignments are interrupted by
numerous recent transposon insertions.
29Details page offers quick access to browsers on
corresponding regions of other genomes. It also
highlights exons in base-by-base alignments.
30Zoom to Base Level
Detail near translation start of tubulin 8
31Zoom to Base Level
Intron consensus sequence visible.
32Zoom to Base Level
Possible alt-splice not consensus and not
conserved.
33Tiling the genome in MicroarraysNew genes on 21
and 22?
34Cross-hybridization at Work
Zoomed in on right side
35200 Bases Upstream of Known Genes 5 Extended by
RNA/EST clusters
gthg15_rnaCluster_chr22.246 rangechr2225204375-2
5204574 5'pad0 3'pad0 revCompTRUE strand-
repeatMaskingnone aactccgcctcggggccccggggcgccgcct
ctctcccccggggcgccgc ctctctcccccggggcgccgcctccctccg
ccgcggccgtcgagccgcgg agcgcctcttccgcggagccgccgcctgc
caggattccagcgccgcagct gcggccgcagccattggtctctgacgtc
agcggcgtgcggcgcactcggc gthg15_rnaCluster_chr22.234
rangechr2224125896-24126095 5'pad0 3'pad0
revCompTRUE strand- repeatMaskingnone ccagggcag
ggcgaggagcgcggggaggggccgcggggacccgggccgct ggggccgt
ggggcccgcccggccgccggccggctccctggggcgcgggcg gctgcgt
cagcggggggcggagacgcggcgctgcttccgctcacgcgcgc cctgct
ccctcctcccagtcgtcctggtccgcggcgcccaacggggaaga gthg15
_rnaCluster_chr22.313 rangechr2229356156-2935635
5 5'pad0 3'pad0 revCompFALSE strand
repeatMaskingnone gccctcccggtccgggggcggggcttggcct
ggggcggggcttggctggg gtgctcagcccaattttccgtgtagggagc
gggcggcggcgggggaggca gaggcggaggcggagtcaagagcgcaccg
ccgcgcccgccgtgccgggcc tgagctggagccgggcgtgagtcgcagc
aggagccgcagccggagtcaca gthg15_rnaCluster_chr22.337
rangechr2230433286-30433485 5'pad0 3'pad0
revCompTRUE strand- repeatMaskingnone actcagaag
ctaagataccgacggtgttcctctgaacttcttccaatggc taaaagct
acaagcgcctcagatataaaagactcctggacggattttcat ccagcac
agagcagctgaatccatatttggcagctagtggatgggataag aggcct
aacagtaagcccatggcactttattctctcgaatccatcaagat gthg15
_rnaCluster_chr22.356 rangechr2232640965-3264116
4 5'pad0 3'pad0 revCompTRUE strand-
repeatMaskingnone ggccccgcgccccaggccggggcgaggcctt
ttccggcgcttctttcccg cggagccgcgggcgggcggcgcaggccctg
ggggagagcgcgccgcggcc ggttgcagccccccccgcgccgccgcgtt
cggcgcccggcccggccagtc tgctcctgccccgccgccgcgccggagc
ccgggcgcccgaagctgggggc
36Acknowledgements
NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers
in the US and worldwide. Baylor, Sanger, Wash U,
Whitehead, Stanford, JGI/ DOE, Oklahoma U and the
international sequencing centers. UCSC, NCBI,
EBI, Ensembl, Genoscope, MGC, Intel, TIGR,
Jackson Labs, Affymetrix, SwissProt.
Webb Miller, Chuck Sugnet, Robert Baertsch, Scott
Schwartz, Fan Hsu, Terry Furey, Ross Hardison,
David Haussler, Richard Gibbs, Bob Waterston,
Eric Lander, Francis Collins, LaDeana Hillier,
Roderic Guigo, Michael Brent, Olivier Jaillon,
David Kulp, Victor Solovyev, Ewan Birney, James
Gilbert, Greg Schuler, Deanna Church, the Gene
Cats. Everyone else!
37THE END
38A Cautionary Note
- Infant digestive systems very permeable, uptake
antibodies - 10 of infants are allergic to cows milk based
formula - These infants get soy/corn based formula
- As we engineer plants, lets be careful what we
put in infant formula
39New Algorithms and Data
- Chaining and netting of mouse/human
alignments precisely define orthology and
quantify rearrangements. - Rat genome is browsable and used in
rat/mouse/human multiple alignments. - Cross-hybridization potential of Affymetrix-style
microarrays calculated and displayed.
40Ideal Gap Penalties
- Would allow gaps in both sequences at once
- Would penalize long gaps less than affine gap
scores. - Still would be quick to compute.
- We use a piecewise linear function of the sum of
gap sizes plus a substantial penalty for gaps
that are in both sequences at once.