Title: Computational Analysis of Genome Sequences
1Computational Analysis of Genome Sequences
- Methods for analyzing compositional properties
- Introduction to Interspersed Repeats
- SINEs
- LINEs
- Retrovirus-like elements
- DNA transposons
- Computational Analysis of Repeats
2Computational Analysis of Genome Sequences
- Motivation, teaching objectives
- Presentation of a practical application of Perl
programming - Advertisement of the do-it-yourself approach
you dont always need professionally engineered
software to make discoveries - Introduction to an important topic in genomics
repetitive elements - Not really a bioinformatics topics
- However, research in this area almost exclusively
relies on computational analysis of genome
sequences - It is important to know whats in the human
genome - Its also important to know how genomes evolve
- Repetitive elements constitute a 3rd class of
evolutionary entities, in addition to cells and
virusus - The biology of repetitive elements is fascinating
3Computational Analysis of Compositional
Properties
Sliding window techniques Complete sequence is
analyzed in overlapping windows of fixed size,
e.g. size 100 bp, overlap 90 bp. For each window,
compute some index e.g. fraction of GC, or
ratio of observed CpG over expected number of CpG
dinuceotides. Plot CG index or CpGobs/CpGexp
against center position of window. Cumulative
plots Complete sequence is analyzed in series of
subsequences all starting at the beginning and
ending at regularly spaced endpoints within the
sequence, e.g 1-100, 1-200, 1-1000000. For each
subsequence, compute some index, e.g. the
difference of G-C over the total number of CG in
the sequence (GC skew). Plot GC skew index
against end position (or genome) of
subsequence.
4Example GC skew in bacterial genomes
From Grigoriev 1998. Nucl. Acids. Res. 262286.
5Explanation for GC skew in bacterial genomes
Spontaneous deamination of cytosine on the
single-stranded template ? base-pairing with A
? insertion of A instead of G.
6Perl script for computing GC-skew
step1000 k 0 number of bases scanned j
0 position within current interval c 0
number of C's found so far g 0 number of
G's foudn so far while (ltSTDINgt) if (not
/gt/) chomp s/ACGT//g
_at_seq split // foreach nucleo
(_at_seq) if (nucleo eq "C") c
if (nucleo eq "G") g
k j if (j gt
step) x g - c
printf "8i 6.4f\n", k, x
j 0 x g - c printf
"8i 6.4f\n", k, x
remove \n at end of line remove non-base
characters convert _ into base array Notelines
are processed one by one increment sequence
counter increment window size counter after
reading step bases, compute GC skew print
current results Reset window counter print
last value
7Interspersed Repeats
Abundant 50 of human DNA consists of
interspersed repeats Length 100-8000
bp Similarity 60 - 100 identity Selfish DNA
designed to increase their own copy number within
the genome Two types DNA transposons cutpaste
mechanism. Retrotransposons replicate through
RNA intermediates like retroviruses. Autonomous
elements contain genes for enzymes necessary for
transposition (e.g. reverse transcriptase)
8Major Classes of Human Interspersed Repeats
9Major characteristics of SINEs (Short
Interspersed Elements)
- Length 80-300 bp
- Non-autonomous elements no protein-coding genes,
presumably rely on reverse-transcriptase from
LINE elements - Transcribed from internal (downstream) POL III
promoters - Poly-A tails for priming reverse-transcription
- Evolutionary origin tRNA or other small RNA
genes. - Prototype Human Alu elements
- Origin 7SL RNA gene
- Contains internal duplication
- Name Based on two internal Alu restriction sites
(Alu digest of complete human DNA produces a
characteristic band) - Several subfamilies
10Major characteristics of LINEs (Long Interspersed
Elements)
- Length 6000-8000 bp
- Autonomous elements encodes several proteins,
including reverse-transcriptase - Presumably transcribed from internal (downstream)
POL II promoter - Poly-A tails for priming reverse-transcription
- Evolutionary origin ancient, perhaps
monophyletic - Other characteristics Many 5truncated copies
presumably resulting from premature termination
of reverse transcription.
11Major characteristics of retrovirus-like elements
(RLE)
- Length 1500-10000 bp
- Flanked by long terminal repeats (LTRs)
containing promoter signals and poly-adenylation
sites. - Autonomous elements encode several proteins,
including reverse-transcriptase - Mammalian MaLR is a non-autonomous RLE lacking a
reverse transcriptase gene - Transcribed from POL II promoters included in
LTR - Elaborate mechanism for restoring LTRs before
insertion - Evolutionary origin ancient, same as
retroviruses - Other characteristics frequent insertion of
single LTRs
12Major characteristics of DNA transposons
- Length 80-3000bp
- Flanked by short inverted repeats.
- Autonomous elements encodes a protein named
transposase or integrase - Original copy cut-out leaving a double-strand
break - Increase of copy-number through
recombination-mediated double-strand break repair - Evolutionary origin ancient, common to
prokaryotes and eukaryotes. - Other characteristics Frequent internal
deletions - Splicosomal introns and inteins (protein introns)
may have originated from self-splicing DNA
transposons
13CutPaste Transposons Increasing Copy-Number
through Repear by Sister Chromatids
This figure explains the events happing after
double strand break. The bottom strand represents
the sister chromatid used for repair, not the
newly inserted transposon (P-element in this case)
14Major Classes of Human Interspersed Repeats
15Evolution of Interspersed Repeats
- Repeat families rapidly expand during short
periods (a few million years) - Repeats classified as young and ancient according
to time if major burst - After insertion, most copies accumulate mutations
rendering them incapable of further propagation - Elements with defunct protein coding genes my
still be able to transpose - Host genome presumably have evolved
counter-mechanisms to contain propagation of
repetitive elements - Repeat propagation may lead to rapid increase in
genome size (doubling of maize genome within less
than 3 million years) - Repeat insertion major type of mutation in many
organisms (Drosophila gt 50, mouse 2.5, human
0.07) - Insertion events can be dated by cross-species
comparison - Ancient elements occur at homologous genome
positions in related species. - Average sequence divergence between repeat family
members indicates age. - Young repeats are species-specific.
16Distribution of Interspersed Repeats in the Human
Genome
Repeats almost exclusively found in non-coding
regions Repeats frequently occur in 3UTRs of
mRNAs (spotted cDNA arrays contain many Alu
elements ? technical problem for
hybridization) Repeats tend to be clustered in
presumably function-less genomic regions Young
elements often inserted into old elements ?
nested repeat structures Alu elements more
frequent in GC rich regions L1 elements more
frequent in AT rich regions
17Most frequent interspersed Repeats in the human
Genome
18Retro-pseudogenes
Intron-free copies of functional genes, usually
not transcribed. Origin reverse transcription of
mRNA followed by insertion into
genome. Generation of retro-pseudogenes presumaby
mediated by LINE-derived enzymes Unlike SINEs and
LINEs, retro-pseudogenes lack internal promoters
? no epidemic amplification within genome Human
genome contains about 30000 retro-pseudogenes Mos
t retro-pseudogenes have accumulated mutations in
the coding regions, including mutations leading
to frame-shifts or premature stop codons Most
retro-pseudogenes derived from housekeeping genes
strongly expressed in the germ-line, including
many small RNA genes.
19Alu repeats in human DNA visualized by a dot
matrix
20Software for detecting repeats RepeatMasker
Purpose Identification and classification of
repeats Masking of repeats for subsequent
analysis Method Local alignment search against
library of consensus sequences of repeat
families. Similarity search program cross-match
(fast SW implementation) or blastn Iterative
search for detection of nested repeats (removal
of complete elements) Ancient and young repeats
searched with different scoring
systems Simultaneously searches for RNA genes
(many small RNA genes repeated in the genome)
Identifies also simple repeats (tandem mono- and
dinucleotide repeats). Output Feature table-like
annotation of repeats Original sequence with
repeats replaced by runs of Ns Notes In
principle, a library of repeats is required for
each species Blastn with a human query containing
an unmasked Alu element will return over 100000
matches to the human genome !