Comparative%20Genomics%20I:%20Tools%20for%20comparative%20genomics - PowerPoint PPT Presentation

About This Presentation
Title:

Comparative%20Genomics%20I:%20Tools%20for%20comparative%20genomics

Description:

Comparative Genomics I: Tools for comparative genomics Penn State Univ.: Ross Hardison, Webb Miller, Francesca Chiaromonte, Laura Elnitski, James Taylor, David King ... – PowerPoint PPT presentation

Number of Views:227
Avg rating:3.0/5.0
Slides: 47
Provided by: RossH154
Learn more at: http://www.bx.psu.edu
Category:

less

Transcript and Presenter's Notes

Title: Comparative%20Genomics%20I:%20Tools%20for%20comparative%20genomics


1
Comparative Genomics I Tools for comparative
genomics
  • Penn State Univ. Ross Hardison, Webb Miller,
    Francesca Chiaromonte, Laura Elnitski, James
    Taylor, David King, Hao Wang, Ying Zhang, Scott
    Schwartz, Shan Yang, Jia Li, Diana Kolbe
  • Univ. California at Santa Cruz David Haussler,
    Jim Kent
  • Lawrence Livermore National Lab Ivan Ovcharenko,
    Lisa Stubbs
  • Institute for Systems Biology Arian Smit
  • Thanks to the Mouse, Rat, Chicken and other
    Genome Sequencing Consortium

2
DNA sequences of mammalian genomes
  • Human 2.9 billion bp, finished
  • High quality, comprehensive sequence, very few
    gaps
  • Mouse, rat, dog, oppossum, chicken, frog etc. etc
    etc.
  • About 40 of the human genome aligns with mouse
  • This is conserved, but not all is under
    selection.
  • About 5-6 of the human genome is under purifying
    selection since the rodent-primate divergence
  • About 1.2 codes for protein
  • The 4 to 5 of the human genome that is under
    selection but does not code for protein should
    have
  • Regulatory sequences
  • Non-protein coding genes (UTRs and noncoding
    RNAs)
  • Other important sequences

3
Leveraging genome evolution to discover function
  • Overall goals and core concepts
  • All-vs-all whole-genome comparisons
  • Comparison of no two species is ideal for finding
    all functional sequences
  • Alignment scores
  • Aid in finding functional elements
  • Discriminate between functional classes
  • Example of experimental tests of the
    bioinformatic predictions

4
Ideal case for interpretation
Similarity
Neutral DNA
Position along chromosome
5
Complications to interpreting divergence
  • Sequence alignments are good but not perfect
  • Models for neutral DNA are not perfect
  • Classic coding nucleotide positions that do not
    cause an amino alteration when changed
  • KS synonymous substitution rate
  • Ancestral repeats
  • Now-defunct transposons that were active in the
    last common ancestor to species being compared
  • Intronic and intergenic DNA
  • Rate of divergence of neutral DNA is NOT constant
  • Varies /- 20 in human-mouse comparisons for 1Mb
    windows across the genome
  • Need to incorporate rate variation into models
    for likelihood of selection
  • E.g. KA / KS ratio (nonsynonymous to synonymous
    substitution rate)

6
Pairwise alignments PipMaker and zPicture
  • http//www.bx.psu.edu/
  • http//www.dcode.org/

7
PipMaker Server for aligning genomic DNA sequences
  • BlastZ
  • Align long sequences (gt 1 megabase, Mb)
  • Handles multiple copies of related genes, other
    sequence rearrangements
  • Compute all local alignments between 2 sequences
    of 1Mb each in about 1 min
  • Zheng Zhang, Webb Miller et al.
  • PipMaker
  • Show results in a compact display with flexible
    features
  • Scott Schwartz, Webb Miller, et al. (2000) Genome
    Res. 10577-586.

8
4 ways to view an alignment of 2 sequences
9
Using PipMaker
  • Files needed
  • Sequence 1 reference sequence (e.g. human),
    FASTA format
  • Sequence 2 other sequence (e.g. mouse), FASTA
    format
  • RepeatMasker output for sequence 1
  • Exons file for sequence 1 (lists position,
    orientation and names of genes and individual
    exons)
  • Optional underlay file to color pip by
    functional category
  • All must be text-only
  • URL is http//bio.cse.psu.edu, go to PipMaker
  • Enter files by browsing or cut-and-paste
  • Submit files, receive output by e-mail.
  • Should align 1 Mb x 1 Mb in less than a minute.

10
Example of using PipMaker BTK human vs. mouse
  • Defects in BTK lead to X-linked
    agammaglobulinemia BTK may be needed for
    maturation of B cells
  • Sequences from R. Gibbs lab, each about 100 kb
  • human GenBank U78027
  • mouse GenBank U58105
  • Exons, underlay files from PipMaker examples
  • Repeats from RepeatMasker

11
Screen shot of PipMaker server
12
Pecent identity plot (pip) from PipMaker BTK
Exons are almost always conserved with no/few
gaps. Highly conserved non-coding sequences in
introns 4 and 5. The conserved sequences 5 and
3 to the 1st exon contribute to
lineage- specific expression of BTK (Oeltjen et
al. 1997).
13
Dot-plot view from PipMaker
14
Automated extraction of sequences and annotations
for PipMaker and zPicture
  • Making exon file (gene and other functional
    annotation) and masking repeats
  • Essential to interpreting the alignments
  • It is a pain
  • Better idea (Ovcharenko) Automate extraction of
    sequences, annotations, masking
  • PipMaker PipHelper
  • zPicture Integrated into the interface

15
DCODE.org Comparative Genomics
16
zPicture interface
17
Automated abstraction of sequence and annotation
18
Global aligners
  • Can get global alignments from Vista
  • Advantageous when the sequences being compared
    are not extensively rearranged and align over
    most of their lengths
  • E.g. comparing two alleles
  • Comparing closely related species

19
A molecular timescale for vertebrate evolution
20
MultiPip Exons and potential regulatory
sequences are revealed progressively
21
Aligners for multiple sequences
  • Local alignments in multiple species
  • MultiPipMaker
  • Mulan (dcode.org)
  • Use pairwise blastZ alignments, joined into a
    multiple alignment by multiZ.
  • Sequence 1 is the reference.
  • Lose sequences in comparison species that do not
    align with the reference in pairwise alignments.
  • Mulan also runs TBA (threaded blockset aligner).
  • Retains all sequences, even those that do not
    align with the reference.
  • Can change reference sequence to get
    human-centric or mouse-centric views of
    multiple alignment
  • Global multiple alignments MLAGAN, MAVID

22
Leveraging genome evolution to discover function
  • Overall goals and core concepts
  • All-vs-all whole-genome comparisons
  • Comparison of no two species is ideal for finding
    all functional sequences
  • Alignment scores
  • Aid in finding functional elements
  • Discriminate between functional classes
  • Example of experimental tests of the
    bioinformatic predictions

23
Whole genome alignments of mammals, birds, flies,
worms and yeast
24
Genome sequence assemblies and sources
Species Assembly Genome size Assembly depth Source
Human hg17 2.851Gb finished International human genome sequencing consortium
Chimp panTro1 ca. 2.8Gb 4x Chimpanzee sequencing consortium
Mouse mm5 2.6Gb 1.9Gb finished Mouse genome sequencing consortium
Rat rn3 2.57Gb Baylor and collaborators
Dog canFam1 2.5Gb 7.6x Broad Institute and Agencourt Bioscience
Cow bosTau1 ca. 3Gb 3x Baylor and collaborators
Opossum monDom1 3.5Gb 6.5x Broad Institute
Platypus ornAna0 Washington University Genome Seq Center
Chicken galGal2 1.2Gb 6.63x Washington University Genome Seq Center
Frog xenTro1 ca. 1.3Gb 7.4x DoE Joint Genome Institute
Zebrafish Zv4 1.56Gb 5.7x Zebrafish Sequencing Group at the Sanger Institute
Tetraodon tetNig1 0.385Gb 7.9x Genoscope and Broad Institute
Fugu fr1 0.319Gb 5.7x DoE Joint Genome Institute and Singapore IMCB
25
Alignment of genomes
  • blastZ for pairwise alignments
  • multiZ for multiple alignment
  • Human, chimp, mouse, rat, chicken, dog
  • Also multiple fly, worm, yeast genomes
  • Organize local alignments chains and nets
  • All against all comparisons
  • High sensitivity and specificity
  • Computer cluster at UC Santa Cruz
  • 1024 cpus Pentium III
  • Job takes about half a day
  • Results available at
  • UCSC Genome Browser http//genome.ucsc.edu
  • Galaxy server http//www.bx.psu.edu

Scott Schwartz
Webb Miller
Jim Kent
Schwartz et al., 2003, blastZ, Genome
Research Blanchette et al., 2004, TBA and multiZ,
Genome Research
David Haussler
26
Genome-wide local alignment chains
Human 2.9 Gb assembly. Mask interspersed
repeats, break into 300 segments of 10 Mb.
Human
Mouse
Run blastZ in parallel for all human segments.
Collect all local alignments above threshold.
Organize local alignments into a set of chains
based on position in assembly and orientation.
27
Comparative genomics to find functional sequences
Genome size
2,900
2,400
2,500
1,200
million base pairs (Mbp)
Papers in Nature from mouse and rat and chicken
genome consortia, 2002, 2004
28
Variation in rates by lineage
Human
  • Substitutions per site in likely neutral DNA
  • Ancestral repeats
  • About 3-fold higher in combined branches to
    rodents than in human
  • Fast rate in rodent, mouse and rat branches
  • Rate for rat branch is slightly faster than for
    mouse
  • Similar differences are seen for microinsertions
    and microdeletions

Rat Genome Sequencing Project Consortium, 2004,
Nature
29
Regional variation in divergence rates
30
Co-variation in substitution, deletion,
insertion, and recombination on Chr 22
31
Implications of co-variation in divergence
  • Large regions (megabase sized) are changing
    relatively fast or slow for (almost) all types of
    divergence
  • Neutral substitution, insertion, deletion,
    recombination
  • This is a consistent property of each region of
    genomic DNA
  • Similar patterns in mouse and human for
    lineage-specific interspersed repeats
  • Similarly fast or slow rates for orthologous
    regions in human-chimp and mouse-rat comparisons
  • An aligned segment with a given similarity score
    in a fast-changing region is MORE significant
    than an aligned segments with the same similarity
    score in a slow-changing region.
  • Must take the differential rate into account in
    searching for functional DNA DNA under
    selection.

32
p-values reflecting different divergence rates
reveal more significant alignments
Jia Li and Webb Miller HMMs to model local rate
variation, then use Markov model to assign
p-value given that local rate.
33
Use measures of alignment quality to discriminate
functional from nonfunctional DNA
  • Compute a conservation score adjusted for the
    local neutral rate
  • Score S for a 50 bp region R is the normalized
    fraction of aligned bases that are identical
  • Subtract mean for aligned ancestral repeats in
    the surrounding region
  • Divide by standard deviation

p fraction of aligned sites in R that
are identical between human and mouse
m average fraction of aligned sites that are
identical in aligned ancestral repeats in the
surrounding region
n number of aligned sites in R
Waterston et al., Nature
34
Decomposition of conservation score into neutral
and likely-selected portions
Neutral DNA (ARs) All DNA Likely selected DNA At
least 5-6
S is the conservation score adjusted for
variation in the local substitution rate. The
frequency of the S score for all 50bp windows in
the human genome is shown.
From the distribution of S scores in ancestral
repeats (mostly neutral DNA), can compute a
probability that a given alignment could result
from locally adjusted neutral rate.
Waterston et al., Nature
35
Coverage of human by alignments with other
vertebrates ranges from 1 to 91
Human
5.4
91
Millions of years
92
173
220
310
360
450
36
Distinctive divergence rates for different types
of functional DNA sequences
37
Leveraging genome evolution to discover function
  • Overall goals and core concepts
  • All-vs-all whole-genome comparisons
  • Comparison of no two species is ideal for finding
    all functional sequences
  • Alignment scores
  • Aid in finding functional elements
  • Discriminate between functional classes
  • Example of experimental tests of the
    bioinformatic predictions

38
Score multi-species alignments for features
associated with function
  • Multiple alignment scores
  • Binomial, parsimony (Margulies et al., 2003,
    Genome Research)
  • PhastCons
  • Siepel et al. 2005, Genome Research
  • Phylogenetic Hidden Markov Model
  • Posterior probability that a site is among the
    10 most highly conserved sites
  • Allows for variation in rates and autocorrelation
    in rates
  • Factor binding sites conserved in human, mouse
    and rat
  • Tffind (from M. Weirauch, Schwartz et al., 2003)
  • Score alignments by frequency of matches to
    patterns distinctive for CRMs
  • Regulatory potential (Elnitski et al., 2003
    Kolbe et al., 2004)

39
Score alignments for level of conservation
  • phastCons (Siepel and Haussler, 2003)
  • Phylogenetic Hidden Markov Model
  • Posterior probability that a site is among the
    10 most highly conserved sites
  • Allows for variation in rates and autocorrelation
    in rates

Alignment seq1 G T A C C T A C T A C G C A
seq2 G T G T C G - - A G C C C A
seq3 G T G A C T - - A C C G C G
40
phastCons on Conservation track at Genome Browser
41
Ultraconserved elements
42
Deletion of locus control region associated with
beta-thalassemia
43
Galaxy metaserver for integrative analysis of
genomic data
  • Use servers at primary data repositories (e.g.
    UCSC Table Browser) to gather initial data
  • Results stored and analyzed at Galaxy
  • Operations
  • Union, intersection, subtraction
  • Clustering, proximity
  • Bioinformatic tools
  • Retrieve alignments
  • KA/KS, PHYLIP programs for molecular evolution
  • EMBOSS tools for sequence analysis
  • http//www.bx.psu.edu

44
Using Galaxy to find predicted CRMs
45
Conclusions
  • Particular types of functional DNA sequences are
    conserved over distinctive evolutionary
    distances.
  • Multispecies alignments can be used to predict
    whether a sequence is functional (signature of
    purifying selection).
  • Alignments can be used to predict certain
    functional regions, including some cis-regulatory
    elements.
  • The predictions of cis-regulatory elements for
    erythroid genes are validated at a good rate.
  • Databases such as the UCSC Table Browser and
    Galaxy provide access to these data.
  • http//genome.ucsc.edu/
  • http//www.bx.psu.edu/
  • Expect improvements at all steps.

46
Many thanks
PSU Database crew Belinda Giardine, Cathy
Riemer, Yi Zhang, Anton Nekrutenko
Wet Lab Yuepin Zhou, Hao Wang, Ying Zhang, Yong
Cheng, David King
RP scores and other bioinformatic
input Francesca Chiaromonte, James Taylor, Shan
Yang, Diana Kolbe, Laura Elnitski
Alignments, chains, nets, browsers, ideas, Webb
Miller, Jim Kent, David Haussler
Funding from NIDDK, NHGRI, Huck Institutes of
Life Sciences at PSU
Write a Comment
User Comments (0)
About PowerShow.com