Assembling and Annotating the Draft Human Genome - PowerPoint PPT Presentation

About This Presentation
Title:

Assembling and Annotating the Draft Human Genome

Description:

... alignments tend to break at transposon insertions, inversions, duplications, etc. ... Alignments are interrupted by numerous recent transposon insertions. ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 41
Provided by: jimk88
Category:

less

Transcript and Presenter's Notes

Title: Assembling and Annotating the Draft Human Genome


1
Evolution and the Santa Cruz Genome Browser
Jim Kent and the Genome Bioinformatics
Group University of California Santa
Cruz Pennsylvania State University
2
Typical Gene Level View
Sialic Acid Binding/Ig-like Lectin 7
3
Typical Gene Level View
Sialic Acid Binding/Ig-like Lectin 7
4
Known Gene Details Page
5
Known Gene Details Page
6
PDB Ribbon Diagram
4 clicks away by the wonder of the world wide web
7
Hox A Cluster, Many Tracks
8
Track Controls are Now Grouped
9
Packed mode saves space, makes labels easier to
find.
10
Squished mode is ideal for ESTs and mouse/human
homology
11
Squished mode is ideal for ESTs and mouse/human
homology
ESTs hint at a smallerversion of exon2
12
Publication Quality Output
13
Comparative Genomics
14
Chaining Alignments
  • Chaining bridges the gulf between syntenic blocks
    and base-by-base alignments.
  • Local alignments tend to break at transposon
    insertions, inversions, duplications, etc.
  • Global alignments tend to force non-homologous
    bases to align.
  • Chaining is a rigorous way of joining together
    local alignments into larger structures.

15
Chains join together related local alignments
Protease Regulatory Subunit 3
16
Affine penalties are too harsh for long gaps
Log count of gaps vs. size of gaps in mouse/human
alignment correlated with sizes of transposon
relics. Affine gap scores model red/blue plots as
straight lines.
17
Gaps are needed in Both Sequences in the General
Case of Pair-Wise Alignment
otherwise non-homologous bases can be forced to
pair
18
2-D histogram of observed gaps.
The horizontal axis is gaps in human, the
vertical axis is gaps in mouse. The logarithm of
counts of gaps in bins of 10 (left) and bins of
500 (right) are plotted as levels of gray with
black representing the highest counts. Note the
concentration of gaps along the axis,
particularly for shorter gaps.
19
Before and After Chaining
20
Chaining Algorithm
  • Input - blocks of gapless alignments from blastz
  • Dynamic program based on the recurrence
    relationship score(Bi) max(score(Bj)
    match(Bi) - gap(Bi, Bj))
  • Uses Millers KD-tree algorithm to minimize which
    parts of dynamic programming graph to traverse.
    Timing is O(N logN), where N is number of blocks
    (which is in hundreds of thousands)

jlti
21
Netting Alignments
  • Commonly multiple mouse alignments can be found
    for a particular human region, particularly for
    coding regions.
  • Net finds best match mouse match for each human
    region.
  • Highest scoring chains are used first.
  • Lower scoring chains fill in gaps within chains
    inducing a natural hierarchy.

22
Net Focuses on Ortholog
23
Net highlights rearrangements
A large gap in the top level of the net is filled
by an inversion containing two genes. Numerous
smaller gaps are filled in by local duplications
and processed pseudo-genes.
24
Useful in finding pseudogenes
Ensembl and Fgenesh automatic gene predictions
confounded by numerous processed pseudogenes.
Domain structure of resulting predicted protein
must be interesting!
25
Mouse/HumanRearrangement Statistics
Number of rearrangements of given type per
megabase.
26
A Rearrangement Hot Spot
Rearrangements are not evenly distributed.
Roughly 5 of the genome is in hot spots of
rearrangements such as this one. This 350,000
base region is between two very long chains on
chromosome 7.
27
year of the rat - 2008
Rat Genome
28
Rat/Mouse/Human Genome-Wide Multiz Alignments
Available
Eye lense protein gamma crystallin a. Upstream
region (on right) is highly conserved but not a
CpG island. Alignments are interrupted by
numerous recent transposon insertions.
29
Details page offers quick access to browsers on
corresponding regions of other genomes. It also
highlights exons in base-by-base alignments.
30
Zoom to Base Level
Detail near translation start of tubulin 8
31
Zoom to Base Level
Intron consensus sequence visible.
32
Zoom to Base Level
Possible alt-splice not consensus and not
conserved.
33
Tiling the genome in MicroarraysNew genes on 21
and 22?
34
Cross-hybridization at Work
Zoomed in on right side
35
200 Bases Upstream of Known Genes 5 Extended by
RNA/EST clusters
gthg15_rnaCluster_chr22.246 rangechr2225204375-2
5204574 5'pad0 3'pad0 revCompTRUE strand-
repeatMaskingnone aactccgcctcggggccccggggcgccgcct
ctctcccccggggcgccgc ctctctcccccggggcgccgcctccctccg
ccgcggccgtcgagccgcgg agcgcctcttccgcggagccgccgcctgc
caggattccagcgccgcagct gcggccgcagccattggtctctgacgtc
agcggcgtgcggcgcactcggc gthg15_rnaCluster_chr22.234
rangechr2224125896-24126095 5'pad0 3'pad0
revCompTRUE strand- repeatMaskingnone ccagggcag
ggcgaggagcgcggggaggggccgcggggacccgggccgct ggggccgt
ggggcccgcccggccgccggccggctccctggggcgcgggcg gctgcgt
cagcggggggcggagacgcggcgctgcttccgctcacgcgcgc cctgct
ccctcctcccagtcgtcctggtccgcggcgcccaacggggaaga gthg15
_rnaCluster_chr22.313 rangechr2229356156-2935635
5 5'pad0 3'pad0 revCompFALSE strand
repeatMaskingnone gccctcccggtccgggggcggggcttggcct
ggggcggggcttggctggg gtgctcagcccaattttccgtgtagggagc
gggcggcggcgggggaggca gaggcggaggcggagtcaagagcgcaccg
ccgcgcccgccgtgccgggcc tgagctggagccgggcgtgagtcgcagc
aggagccgcagccggagtcaca gthg15_rnaCluster_chr22.337
rangechr2230433286-30433485 5'pad0 3'pad0
revCompTRUE strand- repeatMaskingnone actcagaag
ctaagataccgacggtgttcctctgaacttcttccaatggc taaaagct
acaagcgcctcagatataaaagactcctggacggattttcat ccagcac
agagcagctgaatccatatttggcagctagtggatgggataag aggcct
aacagtaagcccatggcactttattctctcgaatccatcaagat gthg15
_rnaCluster_chr22.356 rangechr2232640965-3264116
4 5'pad0 3'pad0 revCompTRUE strand-
repeatMaskingnone ggccccgcgccccaggccggggcgaggcctt
ttccggcgcttctttcccg cggagccgcgggcgggcggcgcaggccctg
ggggagagcgcgccgcggcc ggttgcagccccccccgcgccgccgcgtt
cggcgcccggcccggccagtc tgctcctgccccgccgccgcgccggagc
ccgggcgcccgaagctgggggc
36
Acknowledgements
  • Individuals
  • Institutions

NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers
in the US and worldwide. Baylor, Sanger, Wash U,
Whitehead, Stanford, JGI/ DOE, Oklahoma U and the
international sequencing centers. UCSC, NCBI,
EBI, Ensembl, Genoscope, MGC, Intel, TIGR,
Jackson Labs, Affymetrix, SwissProt.
Webb Miller, Chuck Sugnet, Robert Baertsch, Scott
Schwartz, Fan Hsu, Terry Furey, Ross Hardison,
David Haussler, Richard Gibbs, Bob Waterston,
Eric Lander, Francis Collins, LaDeana Hillier,
Roderic Guigo, Michael Brent, Olivier Jaillon,
David Kulp, Victor Solovyev, Ewan Birney, James
Gilbert, Greg Schuler, Deanna Church, the Gene
Cats. Everyone else!
37
THE END
38
A Cautionary Note
  • Infant digestive systems very permeable, uptake
    antibodies
  • 10 of infants are allergic to cows milk based
    formula
  • These infants get soy/corn based formula
  • As we engineer plants, lets be careful what we
    put in infant formula

39
New Algorithms and Data
  • Chaining and netting of mouse/human
    alignments precisely define orthology and
    quantify rearrangements.
  • Rat genome is browsable and used in
    rat/mouse/human multiple alignments.
  • Cross-hybridization potential of Affymetrix-style
    microarrays calculated and displayed.

40
Ideal Gap Penalties
  • Would allow gaps in both sequences at once
  • Would penalize long gaps less than affine gap
    scores.
  • Still would be quick to compute.
  • We use a piecewise linear function of the sum of
    gap sizes plus a substantial penalty for gaps
    that are in both sequences at once.
Write a Comment
User Comments (0)
About PowerShow.com