Computational Analysis of Genome Sequences - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Computational Analysis of Genome Sequences

Description:

Computational Analysis of Genome Sequences. Motivation, teaching objectives: Presentation of a practical application of Perl programming ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 21

Provided by: isrecI

Category:

more less

Transcript and Presenter's Notes

Title: Computational Analysis of Genome Sequences

1
Computational Analysis of Genome Sequences

Methods for analyzing compositional properties
Introduction to Interspersed Repeats
SINEs
LINEs
Retrovirus-like elements
DNA transposons
Computational Analysis of Repeats

2
Computational Analysis of Genome Sequences

Motivation, teaching objectives
Presentation of a practical application of Perl
programming
Advertisement of the do-it-yourself approach
you dont always need professionally engineered
software to make discoveries
Introduction to an important topic in genomics
repetitive elements
Not really a bioinformatics topics
However, research in this area almost exclusively
relies on computational analysis of genome
sequences
It is important to know whats in the human
genome
Its also important to know how genomes evolve
Repetitive elements constitute a 3rd class of
evolutionary entities, in addition to cells and
virusus
The biology of repetitive elements is fascinating

3
Computational Analysis of Compositional
Properties
Sliding window techniques Complete sequence is
analyzed in overlapping windows of fixed size,
e.g. size 100 bp, overlap 90 bp. For each window,
compute some index e.g. fraction of GC, or
ratio of observed CpG over expected number of CpG
dinuceotides. Plot CG index or CpGobs/CpGexp
against center position of window. Cumulative
plots Complete sequence is analyzed in series of
subsequences all starting at the beginning and
ending at regularly spaced endpoints within the
sequence, e.g 1-100, 1-200, 1-1000000. For each
subsequence, compute some index, e.g. the
difference of G-C over the total number of CG in
the sequence (GC skew). Plot GC skew index
against end position (or genome) of
subsequence.
4
Example GC skew in bacterial genomes
From Grigoriev 1998. Nucl. Acids. Res. 262286.
5
Explanation for GC skew in bacterial genomes
Spontaneous deamination of cytosine on the
single-stranded template ? base-pairing with A
? insertion of A instead of G.
6
Perl script for computing GC-skew
step1000 k 0 number of bases scanned j
0 position within current interval c 0
number of C's found so far g 0 number of
G's foudn so far while (ltSTDINgt) if (not
/gt/) chomp s/ACGT//g
_at_seq split // foreach nucleo
(_at_seq) if (nucleo eq "C") c
if (nucleo eq "G") g
k j if (j gt
step) x g - c
printf "8i 6.4f\n", k, x
j 0 x g - c printf
"8i 6.4f\n", k, x
remove \n at end of line remove non-base
characters convert _ into base array Notelines
are processed one by one increment sequence
counter increment window size counter after
reading step bases, compute GC skew print
current results Reset window counter print
last value
7
Interspersed Repeats
Abundant 50 of human DNA consists of
interspersed repeats Length 100-8000
bp Similarity 60 - 100 identity Selfish DNA
designed to increase their own copy number within
the genome Two types DNA transposons cutpaste
mechanism. Retrotransposons replicate through
RNA intermediates like retroviruses. Autonomous
elements contain genes for enzymes necessary for
transposition (e.g. reverse transcriptase)
8
Major Classes of Human Interspersed Repeats
9
Major characteristics of SINEs (Short
Interspersed Elements)

Length 80-300 bp
Non-autonomous elements no protein-coding genes,
presumably rely on reverse-transcriptase from
LINE elements
Transcribed from internal (downstream) POL III
promoters
Poly-A tails for priming reverse-transcription
Evolutionary origin tRNA or other small RNA
genes.
Prototype Human Alu elements
Origin 7SL RNA gene
Contains internal duplication
Name Based on two internal Alu restriction sites
(Alu digest of complete human DNA produces a
characteristic band)
Several subfamilies

10
Major characteristics of LINEs (Long Interspersed
Elements)

Length 6000-8000 bp
Autonomous elements encodes several proteins,
including reverse-transcriptase
Presumably transcribed from internal (downstream)
POL II promoter
Poly-A tails for priming reverse-transcription
Evolutionary origin ancient, perhaps
monophyletic
Other characteristics Many 5truncated copies
presumably resulting from premature termination
of reverse transcription.

11
Major characteristics of retrovirus-like elements
(RLE)

Length 1500-10000 bp
Flanked by long terminal repeats (LTRs)
containing promoter signals and poly-adenylation
sites.
Autonomous elements encode several proteins,
including reverse-transcriptase
Mammalian MaLR is a non-autonomous RLE lacking a
reverse transcriptase gene
Transcribed from POL II promoters included in
LTR
Elaborate mechanism for restoring LTRs before
insertion
Evolutionary origin ancient, same as
retroviruses
Other characteristics frequent insertion of
single LTRs

12
Major characteristics of DNA transposons

Length 80-3000bp
Flanked by short inverted repeats.
Autonomous elements encodes a protein named
transposase or integrase
Original copy cut-out leaving a double-strand
break
Increase of copy-number through
recombination-mediated double-strand break repair
Evolutionary origin ancient, common to
prokaryotes and eukaryotes.
Other characteristics Frequent internal
deletions
Splicosomal introns and inteins (protein introns)
may have originated from self-splicing DNA
transposons

13
CutPaste Transposons Increasing Copy-Number
through Repear by Sister Chromatids
This figure explains the events happing after
double strand break. The bottom strand represents
the sister chromatid used for repair, not the
newly inserted transposon (P-element in this case)
14
Major Classes of Human Interspersed Repeats
15
Evolution of Interspersed Repeats

Repeat families rapidly expand during short
periods (a few million years)
Repeats classified as young and ancient according
to time if major burst
After insertion, most copies accumulate mutations
rendering them incapable of further propagation
Elements with defunct protein coding genes my
still be able to transpose
Host genome presumably have evolved
counter-mechanisms to contain propagation of
repetitive elements
Repeat propagation may lead to rapid increase in
genome size (doubling of maize genome within less
than 3 million years)
Repeat insertion major type of mutation in many
organisms (Drosophila gt 50, mouse 2.5, human
0.07)
Insertion events can be dated by cross-species
comparison
Ancient elements occur at homologous genome
positions in related species.
Average sequence divergence between repeat family
members indicates age.
Young repeats are species-specific.

16
Distribution of Interspersed Repeats in the Human
Genome
Repeats almost exclusively found in non-coding
regions Repeats frequently occur in 3UTRs of
mRNAs (spotted cDNA arrays contain many Alu
elements ? technical problem for
hybridization) Repeats tend to be clustered in
presumably function-less genomic regions Young
elements often inserted into old elements ?
nested repeat structures Alu elements more
frequent in GC rich regions L1 elements more
frequent in AT rich regions
17
Most frequent interspersed Repeats in the human
Genome
18
Retro-pseudogenes
Intron-free copies of functional genes, usually
not transcribed. Origin reverse transcription of
mRNA followed by insertion into
genome. Generation of retro-pseudogenes presumaby
mediated by LINE-derived enzymes Unlike SINEs and
LINEs, retro-pseudogenes lack internal promoters
? no epidemic amplification within genome Human
genome contains about 30000 retro-pseudogenes Mos
t retro-pseudogenes have accumulated mutations in
the coding regions, including mutations leading
to frame-shifts or premature stop codons Most
retro-pseudogenes derived from housekeeping genes
strongly expressed in the germ-line, including
many small RNA genes.
19
Alu repeats in human DNA visualized by a dot
matrix
20
Software for detecting repeats RepeatMasker
Purpose Identification and classification of
repeats Masking of repeats for subsequent
analysis Method Local alignment search against
library of consensus sequences of repeat
families. Similarity search program cross-match
(fast SW implementation) or blastn Iterative
search for detection of nested repeats (removal
of complete elements) Ancient and young repeats
searched with different scoring
systems Simultaneously searches for RNA genes
(many small RNA genes repeated in the genome)
Identifies also simple repeats (tandem mono- and
dinucleotide repeats). Output Feature table-like
annotation of repeats Original sequence with
repeats replaced by runs of Ns Notes In
principle, a library of repeats is required for
each species Blastn with a human query containing
an unmasked Alu element will return over 100000
matches to the human genome !

Write a Comment

User Comments (0)