Title: Automated sequencing machines,
1- Automated sequencing machines,
- particularly those made by PE Applied
Biosystems, use 4 colors, so they can read all 4
bases at once.
2All the Genes?
- Any human gene can now be found in the genome by
similarity searching with over 95 certainty. - However, the sequence still has many gaps
- unlikely to find an uninterrupted genomic segment
for any gene - still cant identify pseudogenes with certainty
- This will improve as more sequence data
accumulates
3Finding Genes in genome Sequence is Not Easy
- About 2 of human DNA encodes functional genes.
- Genes are interspersed among long stretches of
non-coding DNA. - Repeats, pseudo-genes, and introns confound
matters
4Impact on Bioinformatics
- Genomics produces high-throughput, high-quality
data, and bioinformatics provides the analysis
and interpretation of these massive data sets. - It is impossible to separate genomics laboratory
technologies from the computational tools
required for data analysis.
5Completed genome projects
Eukaryotes 9 In progress (partial) Anopheles
gambiae Danio rerio (zebrafish) Arabidopsis
thaliana Glycine max (soybean) Caenorhabditis
elegans Hordeum vulgare (barley)
Drosophila melanogaster Leishmania major
Encephalitozoon cuniculi Rattus norvegicus
Guillardia theta nucleomorph Plasmodium
falciparum Saccharomyces cerevisiae
(yeast) Schizosaccharomyces pombe Bacteria
132 Archaea 16 Viruses 1413
6Six basic questions about genomes
1 how is a genome sequenced? 2 when is the
project finished? 3 sequence one individual or
many? 4 what information is in the DNA? 5 how
many genes are in the genome? 6 how can whole
genomes be compared?
71 Genome projects sequencing strategies
Hierarchical shotgun method Assemble contigs from
various chromosomes, then sequence and assemble
them. A contig is a set of overlapping clones or
sequences from which a sequence can be obtained.
The sequence may be draft or finished. A contig
is thus a chromosome map showing the locations of
those regions of a chromosome where
contiguous DNA segments overlap. Contig maps are
important because they provide the ability to
study a complete, and often large segment of the
genome by examining a series of overlapping
clones which then provide an unbroken succession
of information about that
region. Scaffold an ordered set of contigs
placed on a chromosome.
Shotgun An approach used to decode an organism's
genome by shredding it into smaller fragments of
DNA which can be sequenced individually. The
sequences of these fragments are then ordered,
based on overlaps in the genetic code, and
finally reassembled into the complete sequence.
The 'whole genome shotgun' method is applied to
the entire genome all at once, while the
'hierarchical shotgun' method is applied to
large, overlapping DNA fragments of known
location in the genome.
http//www.genome.gov/glossary.cfm
83. Whole Genome Shotgun Sequencing
genome
forward-reverse linked reads
9ARACHNE Whole Genome Shotgun Assembly
http//www-genome.wi.mit.edu/wga/
102 When is the project finished?
Get five to ten-fold coverage
Finished sequence a clone insert is
contiguously sequenced with high quality standard
of error rate 0.01. There are usually no gaps in
the sequence. Draft sequence clone sequences
may contain several regions separated by gaps.
The true order and orientation of the pieces may
not be known.
11(No Transcript)
12 Repetitive DNA sequences five classes
1 Interspersed repeats transposon-derived
repeats -- 45 of human genome LTR, SINE,
LINE 2 Processed pseudogenes 3 Simple
sequence repeats -- micro- and
minisatellites -- ACAAACT, 11 million times in a
Drosophila -- Human genome has 50,000 CA
dinucleotide repeats 4 Segmental duplications
(about 5 of human genome) 5 Tandem repeats
(e.g. telomeres, centromeres)
13- LINE and SINE repeats. A LINE (long interspersed
nuclear element) encodes a reverse transcriptase
(RT) and perhaps other proteins. Mammalian
genomes contain an old LINE family, called LINE2,
which apparently stopped transposing before the
mammalian radiation, and a younger family, called
L1 or LINE1, many of which were inserted after
the mammalian radiation (and are still being
inserted). A SINE (short interspersed nuclear
element) generally moves using RT from a LINE.
Examples include the MIR elements, which
co-evolved with the LINE2 elements. Since the
mammalian radiation, each lineage has evolved its
own SINE family. Primates have Alu elements and
mice have B1, B2, etc. The process of insertion
of a LINE or SINE into the genome causes a short
sequence (7-21 bp for Alus) to be repeated, with
one copy (in the same orientation) at each end of
the inserted sequence. Alus have accumulated
preferentially in GC-rich regions, L1s in GC-poor
regions.
14What is the function of nongenic DNA?
- Hypotheses
- Nongenic DNA performs essential functions, such
as - regulation of gene expression.
- Nongenic DNA is inert, genetically and
physiologically. - Excess DNA is incidental and is called junk
DNA. - Nongenic DNA is a functional parasite or selfish
DNA - (retrotransposons).
- Nongenic DNA has a structural function.
155 How many genes are in the a genome?
This depends how a gene is defined (e.g.
protein- coding versus noncoding) It also
depends what methods are used to find genes, and
what criteria are applied to determine
whether they are real (functional).
16Clasificación del ADN
- FUNCIONAL (secuencias que cumplen una función)
- - Codante (se traducen en proteínas)
- -No codante (no se traducen)
- Transcrito (cumple función a nivel de RNA
subun. ribos.) - No transcrito (cumple función a nivel de
DNA intrón, promotor,
enhancer, etc.) - NO-FUNCIONAL (secuencias que no cumplen ninguna
función Junk DNA basura)
17 Gene-finding algorithms
Homology-based searches (extrinsic) Rely on
previously identified genes Algorithm-based
searches (intrinsic) Investigate nucleotide
composition, open- reading frames, and other
intrinsic properties of genomic DNA
18DNA
RNA
intron
Mature RNA
protein
19Homology-based searching compare DNA to
expressed genes (ESTs)
DNA
RNA
intron
RNA
protein
20DNA
RNA
Algorithm-based searching compare DNA in
exons (unique codon usage) to introns (unique
splices sites) to noncoding DNA. Identify open
reading frames (ORFs).
21(No Transcript)
22(No Transcript)
235 How many genes are in the human genome?
One answer is about 30,000. BUT how many genes?
-- A lot more than a fungus (6,000) -- Somewhat
more than a fly (13,000) or a worm (19,000) --
About the same as a plant (Arabidopsis,
25,000) -- Two groups estimate 30,000 to 35,000,
but there is only partial overlap in their
gene lists! -- One Drosophila gene potentially
yields 38,000 distinct proteins by
alternative splicing. -- A microarray-based
survey of chromosomes 21, 22 finds 10 times
more transcripts than are annotated
246 how can whole genomes be compared?
-- molecular phylogeny -- You can BLAST (or
PSI-BLAST) all the DNA and/or protein in one
genome against another -- We looked at TaxPlot
and COG for bacterial (and for some
eukaryotic) genomes -- PipMaker and other
programs align large stretches of genomic DNA
from multiple species
25Resources to study the human genome
NCBI www.ncbi.nlm.nih.gov The Sanger
Institute/European Bioinformatics
Institute www.ensembl.org UCSC Genome
Bioinformatics Site http//genome.ucsc.edu/
26Top ten challenges for bioinformatics
1 Precise models of where and when
transcription will occur in a genome
(initiation and termination) 2 Precise models
RNA splicing 3 Precise models of signal
transduction pathways ability to predict
cellular responses to external stimuli 4
Determining proteinDNA, proteinRNA,
proteinprotein recognition codes 5
Accurate ab initio protein structure prediction
27Top ten challenges for bioinformatics
6 Rational design of small molecule inhibitors
of proteins 7 Mechanistic understanding of
protein evolution 8 Mechanistic understanding
of speciation 9 Development of effective gene
ontologies systematic ways to describe
gene and protein function 10 Education
development of bioinformatics curricula
28Comparative GenomicsUsing ACTThe Artemis
Comparison Tool
29Artemis comparison tool ACT
- Based on artemis and coded in java.
- Allows visualisation of two sequences or more and
a comparison file. - The comparison file can be BLASTn or tBLASTx.
- Retains all the functionality of artemis.
30The ACT Display
genome1
Zoom scroll bar
Filter scroll bar
genome2
Genome2
Blast HSPs
genome3
31Running ACT
Sequence 1
Sequence 2
BLASTn tBLASTx
MSPcrunch
Reformat
32ACT
- Designed for looking at complete bacterial
genomes.
33Knowlesi contgs
tblastx
Falciparum Chr 3
tblastx
Yoelii Contigs (TIGR)
34(No Transcript)
35Orthologue Paralogue
- Orthologue- homologous genes with identical
function in different organisms. - Paralogue- homologous genes in the same organism
originated from gene duplication.
36Orthologue Paralogue
Gene A
37Orthologue Paralogue
38Orthologue Paralogue
39Orthologue Paralogue
Species 1
Species 2
Gene A
Gene B
40AG-FMVZ-USP
41(No Transcript)
42(No Transcript)
43T. brucei vs L. major (cont.)
44T. brucei vs T. cruzi
45L. major has break in synteny that is conserved
in T. brucei and T. cruzi
T. cruzi Chr3.
T. Brucei chr1
L. Major chr12
T. Brucei chr6
46Software
- www.sanger.ac.uk/Software/Artemis
- www.sanger.ac.uk/Software/ACT
- www.genome.nghri.nih.gov/blastall
- www.cgr.ki.se/cgr/goups/sonnhammer/MSPcrunch.html