Computational Analysis of Genome Sequences - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Computational Analysis of Genome Sequences

Description:

Computational Analysis of Genome Sequences. Motivation, teaching objectives: Presentation of a practical application of Perl programming ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 21
Provided by: isrecI
Category:

less

Transcript and Presenter's Notes

Title: Computational Analysis of Genome Sequences


1
Computational Analysis of Genome Sequences
  • Methods for analyzing compositional properties
  • Introduction to Interspersed Repeats
  • SINEs
  • LINEs
  • Retrovirus-like elements
  • DNA transposons
  • Computational Analysis of Repeats

2
Computational Analysis of Genome Sequences
  • Motivation, teaching objectives
  • Presentation of a practical application of Perl
    programming
  • Advertisement of the do-it-yourself approach
    you dont always need professionally engineered
    software to make discoveries
  • Introduction to an important topic in genomics
    repetitive elements
  • Not really a bioinformatics topics
  • However, research in this area almost exclusively
    relies on computational analysis of genome
    sequences
  • It is important to know whats in the human
    genome
  • Its also important to know how genomes evolve
  • Repetitive elements constitute a 3rd class of
    evolutionary entities, in addition to cells and
    virusus
  • The biology of repetitive elements is fascinating

3
Computational Analysis of Compositional
Properties
Sliding window techniques Complete sequence is
analyzed in overlapping windows of fixed size,
e.g. size 100 bp, overlap 90 bp. For each window,
compute some index e.g. fraction of GC, or
ratio of observed CpG over expected number of CpG
dinuceotides. Plot CG index or CpGobs/CpGexp
against center position of window. Cumulative
plots Complete sequence is analyzed in series of
subsequences all starting at the beginning and
ending at regularly spaced endpoints within the
sequence, e.g 1-100, 1-200, 1-1000000. For each
subsequence, compute some index, e.g. the
difference of G-C over the total number of CG in
the sequence (GC skew). Plot GC skew index
against end position (or genome) of
subsequence.
4
Example GC skew in bacterial genomes
From Grigoriev 1998. Nucl. Acids. Res. 262286.
5
Explanation for GC skew in bacterial genomes
Spontaneous deamination of cytosine on the
single-stranded template ? base-pairing with A
? insertion of A instead of G.
6
Perl script for computing GC-skew
step1000 k 0 number of bases scanned j
0 position within current interval c 0
number of C's found so far g 0 number of
G's foudn so far while (ltSTDINgt) if (not
/gt/) chomp s/ACGT//g
_at_seq split // foreach nucleo
(_at_seq) if (nucleo eq "C") c
if (nucleo eq "G") g
k j if (j gt
step) x g - c
printf "8i 6.4f\n", k, x
j 0 x g - c printf
"8i 6.4f\n", k, x
remove \n at end of line remove non-base
characters convert _ into base array Notelines
are processed one by one increment sequence
counter increment window size counter after
reading step bases, compute GC skew print
current results Reset window counter print
last value
7
Interspersed Repeats
Abundant 50 of human DNA consists of
interspersed repeats Length 100-8000
bp Similarity 60 - 100 identity Selfish DNA
designed to increase their own copy number within
the genome Two types DNA transposons cutpaste
mechanism. Retrotransposons replicate through
RNA intermediates like retroviruses. Autonomous
elements contain genes for enzymes necessary for
transposition (e.g. reverse transcriptase)
8
Major Classes of Human Interspersed Repeats
9
Major characteristics of SINEs (Short
Interspersed Elements)
  • Length 80-300 bp
  • Non-autonomous elements no protein-coding genes,
    presumably rely on reverse-transcriptase from
    LINE elements
  • Transcribed from internal (downstream) POL III
    promoters
  • Poly-A tails for priming reverse-transcription
  • Evolutionary origin tRNA or other small RNA
    genes.
  • Prototype Human Alu elements
  • Origin 7SL RNA gene
  • Contains internal duplication
  • Name Based on two internal Alu restriction sites
    (Alu digest of complete human DNA produces a
    characteristic band)
  • Several subfamilies

10
Major characteristics of LINEs (Long Interspersed
Elements)
  • Length 6000-8000 bp
  • Autonomous elements encodes several proteins,
    including reverse-transcriptase
  • Presumably transcribed from internal (downstream)
    POL II promoter
  • Poly-A tails for priming reverse-transcription
  • Evolutionary origin ancient, perhaps
    monophyletic
  • Other characteristics Many 5truncated copies
    presumably resulting from premature termination
    of reverse transcription.

11
Major characteristics of retrovirus-like elements
(RLE)
  • Length 1500-10000 bp
  • Flanked by long terminal repeats (LTRs)
    containing promoter signals and poly-adenylation
    sites.
  • Autonomous elements encode several proteins,
    including reverse-transcriptase
  • Mammalian MaLR is a non-autonomous RLE lacking a
    reverse transcriptase gene
  • Transcribed from POL II promoters included in
    LTR
  • Elaborate mechanism for restoring LTRs before
    insertion
  • Evolutionary origin ancient, same as
    retroviruses
  • Other characteristics frequent insertion of
    single LTRs

12
Major characteristics of DNA transposons
  • Length 80-3000bp
  • Flanked by short inverted repeats.
  • Autonomous elements encodes a protein named
    transposase or integrase
  • Original copy cut-out leaving a double-strand
    break
  • Increase of copy-number through
    recombination-mediated double-strand break repair
  • Evolutionary origin ancient, common to
    prokaryotes and eukaryotes.
  • Other characteristics Frequent internal
    deletions
  • Splicosomal introns and inteins (protein introns)
    may have originated from self-splicing DNA
    transposons

13
CutPaste Transposons Increasing Copy-Number
through Repear by Sister Chromatids
This figure explains the events happing after
double strand break. The bottom strand represents
the sister chromatid used for repair, not the
newly inserted transposon (P-element in this case)
14
Major Classes of Human Interspersed Repeats
15
Evolution of Interspersed Repeats
  • Repeat families rapidly expand during short
    periods (a few million years)
  • Repeats classified as young and ancient according
    to time if major burst
  • After insertion, most copies accumulate mutations
    rendering them incapable of further propagation
  • Elements with defunct protein coding genes my
    still be able to transpose
  • Host genome presumably have evolved
    counter-mechanisms to contain propagation of
    repetitive elements
  • Repeat propagation may lead to rapid increase in
    genome size (doubling of maize genome within less
    than 3 million years)
  • Repeat insertion major type of mutation in many
    organisms (Drosophila gt 50, mouse 2.5, human
    0.07)
  • Insertion events can be dated by cross-species
    comparison
  • Ancient elements occur at homologous genome
    positions in related species.
  • Average sequence divergence between repeat family
    members indicates age.
  • Young repeats are species-specific.

16
Distribution of Interspersed Repeats in the Human
Genome
Repeats almost exclusively found in non-coding
regions Repeats frequently occur in 3UTRs of
mRNAs (spotted cDNA arrays contain many Alu
elements ? technical problem for
hybridization) Repeats tend to be clustered in
presumably function-less genomic regions Young
elements often inserted into old elements ?
nested repeat structures Alu elements more
frequent in GC rich regions L1 elements more
frequent in AT rich regions
17
Most frequent interspersed Repeats in the human
Genome
18
Retro-pseudogenes
Intron-free copies of functional genes, usually
not transcribed. Origin reverse transcription of
mRNA followed by insertion into
genome. Generation of retro-pseudogenes presumaby
mediated by LINE-derived enzymes Unlike SINEs and
LINEs, retro-pseudogenes lack internal promoters
? no epidemic amplification within genome Human
genome contains about 30000 retro-pseudogenes Mos
t retro-pseudogenes have accumulated mutations in
the coding regions, including mutations leading
to frame-shifts or premature stop codons Most
retro-pseudogenes derived from housekeeping genes
strongly expressed in the germ-line, including
many small RNA genes.
19
Alu repeats in human DNA visualized by a dot
matrix
20
Software for detecting repeats RepeatMasker
Purpose Identification and classification of
repeats Masking of repeats for subsequent
analysis Method Local alignment search against
library of consensus sequences of repeat
families. Similarity search program cross-match
(fast SW implementation) or blastn Iterative
search for detection of nested repeats (removal
of complete elements) Ancient and young repeats
searched with different scoring
systems Simultaneously searches for RNA genes
(many small RNA genes repeated in the genome)
Identifies also simple repeats (tandem mono- and
dinucleotide repeats). Output Feature table-like
annotation of repeats Original sequence with
repeats replaced by runs of Ns Notes In
principle, a library of repeats is required for
each species Blastn with a human query containing
an unmasked Alu element will return over 100000
matches to the human genome !
Write a Comment
User Comments (0)
About PowerShow.com