Title: Genes, Genomes, and Genomics
1Genes, Genomes, and Genomics
Bioinformatics in the Classroom June, 2003
2Two again
Francis Collins, HGP
Craig Venter, Celera Inc.
3Whats in a chromosome?
4Hierarchical vs. Whole Genome
5How many genes?
- Consortium 35,000 genes?
- Celera 30,000 genes?
- Affymetrix 60,000 human genes on GeneChips?
- Incyte and HGS over 120,000 genes?
- GenBank 49,000 unique gene coding sequences?
- UniGene gt 89,000 clusters of unique ESTs?
6Current consensus (in flux )
- 15,000 known genes (similarity to previously
isolated genes and expressed sequences from a
large variety of different organisms) - 17,000 predicted (GenScan, GeneFinder, GRAIL)
- Based on and limited to previous knowledge
7How to we get from here
8to here,
9What are genes? - 2
- Complete DNA segments responsible to make
functional products - Products
- Proteins
- Functional RNA molecules
- RNAi (interfering RNA)
- rRNA (ribosomal RNA)
- snRNA (small nuclear)
- snoRNA (small nucleolar)
- tRNA (transfer RNA)
10What are genes? - 2
- Definition vs. dynamic concept
- Consider
- Prokaryotic vs. eukaryotic gene models
- Introns/exons
- Posttranscriptional modifications
- Alternative splicing
- Differential expression
- Genes-in-genes
- Genes-ad-genes
- Posttranslational modifications
- Multi-subunit proteins
11Prokaryotic gene model
- Small genomes, high gene density
- Haemophilus influenza genome 85 genic
- Operons
- One transcript, many genes
- No introns.
- One gene, one protein
- Open reading frames
- One ORF per gene
- ORFs begin with start,
- end with stop codon (def.)
TIGR http//www.tigr.org/tigr-scripts/CMR2/CMRGen
omes.spl NCBI http//www.ncbi.nlm.nih.gov/PMGifs/
Genomes/micr.html
12Eukaryotic gene model
- Large genomes, low gene density
- Homo sapiens genome 25 genic
- Posttranscriptional modification
- 5-CAP, polyA tail, splicing
- Open reading frames
- One ORF per exon, dont all contain starts/stops
- Mature mRNA contains ORF par definitionem
- Multiple translates
- One gene many proteins via alternative splicing
- Posttranscriptional modification
- 5-CAP, polyA tail, splicing
- Open reading frames
- One ORF per exon, dont all contain starts/stops
- Mature mRNA contains ORF par definitionem
- Multiple translates
- One gene many proteins via alternative splicing
13Expansions and Clarifications
- ORFs
- Start triplets stop
- Prokaryotes gene ORF
- Eukaryotes gene ORF, or
- Final mRNA contains ORF
- Exons
- Remain after introns have been removed
- Flanking contain non-coding sequence (5- and
3-UTRs)
14Where do genes live?
- In genomes
- Example human genome
- Ca. 3,200,000,000 base pairs
- 25 chromosomes 1-22, X, Y, mt
- 28,000-45,000 genes (current estimate)
- 128 nucleotides (RNA gene) 2,800 kb (DMD)
- Ca. 25 of genome are genes (introns, exons)
- Ca. 1 of genome codes for amino acids (CDS)
- 30 kb gene length (average)
- 1.4 kb ORF length (average)
- 3 transcripts per gene (average)
15Sample genomes
Species Size Genes Genes/Mb
H.sapiens 3,200Mb 35,000 11
D.melanogaster 137Mb 13.338 97
C.elegans 85.5Mb 18,266 214
A.thaliana 115Mb 25,800 224
S.cerevisiae 15Mb 6,144 410
E.coli 4.6Mb 4,300 934
List of 68 eukaryotes, 141 bacteria, and 17
archaea at http//www.ncbi.nlm.nih.gov/PMGifs/Geno
mes/links2a.html
16The value of genome sequences lies in their
annotation
- Annotation Characterizing genomic features
using computational and experimental methods - Genes Four levels of annotation
- Gene Prediction Where are genes?
- What do they look like?
- Domains What do the proteins do?
- Role What pathway(s) involved in?
17So much DNA so few genes
18Genomic sequence features I
- Repeats
- Transposable elements, simple repeats
- RepeatMasker
- Uses Smith Waterman algorithm to align sequences
to known repeats - Advantage avoid spurious matches to repetitive
elements - Disadvantage mask sequences that gene
prediction programs may need for statistics
19Genomic sequence features II
- Non-coding RNAs (ncRNA)
- tRNA tRNASCAN-SE
- Identifies candidates by scanning for pol III
promoters. Then it uses an algorithm to determine
the RNA. - rRNA, snRNA, miRNA, etc. COVE
- Identified by performing similarity searches and
RNA structure analysis using covariance models.
20Genomic sequence features III
- Genes
- Vary in density, length, structure, number of
introns, number of splice forms, etc. - Identification method depends on evidence,
expertise and methods available. - Gene identification usually requires concerted
application of bioinformatics methods and wet
experimentation. - Pseudo genes
- Look-a-likes of genes not transcribed. Obstruct
gene finding efforts.
21Gene identification
- Homology-based gene prediction
- Similarity Searches (e.g. BLAST, BLAT)
- Genome Browsers
- RNA evidence (ESTs)
- Ab initio gene prediction
- Gene prediction programs
- Prokaryotes
- ORF identification
- Eukaryotes
- Promoter prediction
- PolyA-signal prediction
- Splice site, start/stop-codon predictions
22Gene prediction through comparative genomics
- Purifying selection Conserved regions between
two genomes are useful or else they would have
diverged. - If genomes are too close in the phylogenetic
tree, there may be too much noise. - If genomes are too far apart, analogous regions
may be missed.
23Genome Browsers
NCBI Map Viewer www.ncbi.nlm.nih.gov/mapview/
Generic Genome Browser (CSHL) www.wormbase.org/db
/seq/gbrowse
Ensembl Genome Browser www.ensembl.org/
UCSC Genome Browser genome.ucsc.edu/cgi-bin/hgGate
way?orghuman
Apollo Genome Browser www.bdgp.org/annot/apollo/
24Gene discovery using ESTs
- Expressed Sequence Tags (ESTs) represent
sequences from expressed genes. - If region matches EST with high stringency then
region is probably a gene or pseudo gene. - EST overlapping exon boundary gives an accurate
prediction of exon boundary.
25Tools for EST analysis
- BLAST (Basic Local Alignment Search Tool)
- Smaller exons will be missed due to smaller
score. - SIM4 (http//pbil.univ-lyon1.fr/sim4.html)
- Useful tool to map a gene to genomic sequence
- Allows for large gaps (introns)
26Ab initio gene prediction
- Prokaryotes
- ORF-Detectors
- Eukaryotes
- Position, extent direction through promoter
and polyA-signal predictors - Structure through splice site predictors
- Exact location of coding sequences through
determination of relationships between potential
start codons, splice sites, ORFs, and stop codons
27Tools
- ORF detectors
- NCBI http//www.ncbi.nih.gov/gorf/gorf.html
- Promoter predictors
- CSHL http//rulai.cshl.org/software/index1.htm
- BDGP fruitfly.org/seq_tools/promoter.html
- ICG TATA-Box predictor
- PolyA signal predictors
- CSHL argon.cshl.org/tabaska/polyadq_form.html
- Splice site predictors
- BDGP http//www.fruitfly.org/seq_tools/splice.htm
l - Start-/stop-codon identifiers
- DNALC Translator/ORF-Finder
- BCM Searchlauncher
28How it works I Motif identification
- Exon-Intron Borders Splice Sites
Exon Intron
Exon gaggcatcaggtttgtagactgtgtttcag
tgcacccact ccgccgctgagtgagccgtgtc
tattctaggacgcgcggg tgtgaattaggtaagaggtt
atatctccagatggagatca ccatgaggaggtgagtg
ccattatttccaggtatgagacg
Splice site Splice site
Exon Intron
Exon gaggcatcagGTttgtagactgtgtttcAG
tgcacccact ccgccgctgaGTgagccgtgtc
tattctAGgacgcgcggg tgtgaattagGTaagaggtt
atatctccAGatggagatca ccatgaggagGTgagtg
ccattatttccAGgtatgagacg
Splice site Splice site
Motif Extraction Programs at http//www-btls.jst.g
o.jp/
29How it works II - Movies
Pribnow-Box Finder 0/1 Pribnow-Box Finder all
30How it works III The (ugly) truth
31Gene prediction programs
- Rule-based programs
- Use explicit set of rules to make decisions.
- Example GeneFinder
- Neural Network-based programs
- Use data set to build rules.
- Examples Grail, GrailEXP
- Hidden Markov Model-based programs
- Use probabilities of states and transitions
between these states to predict features. - Examples Genscan, GenomeScan
32Rule Based - GeneFinder
- Compares expected vs. observed frequencies to
score features such as codon bias via Log
Likelihood Ratios (LLR). - Each position of a sequence is scored in respect
to its potential of being a splice site or
translational start site. - LLR scores and ORF identification determine
maximum-length coding segments. - Total score for a gene is the sum of exon scores
minus the gap penalty. - Rather bad for first and last exons.
33Neural Networks - Grail, GrailEXP
- Utilizes sensors trained on a set of known genes
of the organism. - Sensors examine
- Frame Bias Matrix - Uses codon bias to determine
ORFs. - Fickett Pentamer position weight matrices.
- Dinucleotide Fractal Dimensions - Transition of
sequential dinucleotides is represented as
fractal dimension. CDS differ from nCDS. - GrailExp incorporates similarity-based method by
adding a blastn component to its prediction
algorithm. Runs reliably on unmasked sequences.
34HMM Genscan, Genomescan
- Genscan uses known transcriptional and
translational signals and then uses HMM to model
coding and non-coding regions. - Genomescan incorporates similarity-based method
by adding a blastX component to its prediction
algorithm, using the translated sequence to
search protein db.
35Evaluating prediction programs
- Sensitivity vs. Specificity
- Sensitivity
- How many genes were found out of all present?
- Sn TP/(TPFN)
- Specificity
- How many predicted genes are indeed genes?
- Sp TP/(TPFP)
- Programs that combine statistical evaluations
with similarity searches most powerful.
36Gene prediction accuracies
- Nucleotide level 95Sn, 90Sp (Lows less than
50) - Exon level 75Sn, 68Sp (Lows less than 30)
- Gene Level 40 Sn, 30Sp (Lows less than 10)
- Selected readings
- Parra et al. (2003). Comparative Gene Prediction
in Human and Mouse. Genome Research 13108-117. - Rogic et al. (2001). Evaluation of Gene-Finding
Programs in Mammalian Sequences. Genome Research
11817-832. - Guigo et al. (2000). An Assessment of Gene
Prediction Accuracy in Large DNA Sequences.
Genome Research 101631-1642. - Reese et al. (2000). Genome Annotation Assessment
in Drosophila melanogaster. Genome Research
10483-501. - Burge and Karlin (1997). Prediction of Complete
Gene Structures in Human Genomic DNA Tab. 1. JMB
26878-94.
37Common difficulties
- First and last exons difficult to annotate
because they contain UTRs. - Smaller genes are not statistically significant
so they are thrown out. - Algorithms are trained with sequences from known
genes which biases them against genes about which
nothing is known. - Masking repeats frequently removes potentially
indicative chunks from the untranslated regions
of genes that contain repetitive elements.
38The annotation pipeline
- Mask repeats using RepeatMasker.
- Run sequence through several programs.
- Take predicted genes and do similarity search
against ESTs and genes from other organisms. - Do similarity search for non-coding sequences to
find ncRNA.
39Annotation nomenclature
- Known Gene Predicted gene matches the entire
length of a known gene. - Putative Gene Predicted gene contains region
conserved with known gene. Also referred to as
like or similar to. - Unknown Gene Predicted gene matches a gene or
EST of which the function is not known. - Hypothetical Gene Predicted gene that does not
contain significant similarity to any known gene
or EST.