Course Outline - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Course Outline

Description:

Course Outline – PowerPoint PPT presentation

Number of Views:601
Avg rating:3.0/5.0
Slides: 59
Provided by: Ben5152
Category:
Tags: course | fugu | outline

less

Transcript and Presenter's Notes

Title: Course Outline


1
Course Outline
2
What Happens when HGP is Completed ?
  • Known gene number, location, function and
    regulation.
  • Chromosome structure and organization.
  • Non-coding DNA types, amount, distribution and
    function.
  • Gene expression, protein synthesis, and
    post-translation events,
  • protein interactions and networks.
  • Evolutionary conservation among organisms,
    (structure and function).
  • Correlation of SNPs with health and disease.
  • Genes involved in complex traits and multi-gene
    diseases.

3
The Next Step
Locate all the genes and describe their
function. This will probably take another 15-20
years !
4
Reminder Genome Browsers and Gene Prediction
5
Eukaryotes vs Prokaryotes
Typical human bacterial cells drawn to
scale.
  • Eukaryotic cells are
  • characterized by
  • membrane-bound
  • compartments,
  • which are absent
  • in prokaryotes.

BIOS Scientific Publishers Ltd, 1999
6
Gene Prediction is Different for Eukaryotes and
Prokaryotes
  • Prokaryotic genes
  • Eukaryotic genes

http//csbl.bmb.uga.edu/resources/slides/gene-find
ing-2.ppt275,4,Review
7
Eukaryotes Splice Signals
8
Simple Prokaryotic Gene Model
  • Simple promoter
  • Small genomes, high gene density -
  • Example Haemophilus influenza genome
  • 85 genic
  • Operons - One transcript, many genes
  • No introns - One gene, one protein
  • Open reading frames - One ORF per gene, ORFs
    begin with start, end with stop codon

TIGR http//www.tigr.org/tigr-scripts/CMR2/CMRGen
omes.spl NCBI http//www.ncbi.nlm.nih.gov/PMGifs/
Genomes/micr.html
9
Gene finding is simple !
http//tardigrada.cap.ed.ac.uk/teaching/genomics/t
ech2/img2.htm
10
Prokaryotic Gene Prediction Tools
  • Glimmer
  • http//cbcb.umd.edu/software/glimmer/
  • GeneMark
  • http//exon.gatech.edu/genemark/genemark_prok_gms_
    plus.cgi
  • ORNL Annotation Pipeline
  • http//compbio.ornl.gov/GP3/pro.html
  • FramePlot - protein-coding region prediction tool
    for high GC-content bacteria
  • http//www.nih.go.jp/jun/cgi-bin/frameplot.pl

11
Eukaryotes
  • Complex gene structure.
  • Large genomes (0.1 to 3 billion bases).
  • Exons and introns (interrupted).
  • Low coding density (lt30).
  • 2-3 in humans, 25 in Fugu, 60 in yeast
  • Alternate splicing (40-60 of all genes).
  • Considerable number of pseudogenes.

12
Non-Protein Coding Gene Tools
  • tRNA
  • tRNA-ScanSE (http//www.genetics.wustl.edu/eddy/tR
    NAscan-SE/)
  • FAStRNA (http//bioweb.pasteur.fr/seqanal/interfac
    es/fastrna.html)
  • snoRNA
  • snoRNA database (http//rna.wustl.edu/snoRNAdb/)
  • microRNA
  • Sfold (http//www.bioinfo.rpi.edu/applications/sfo
    ld/index.pl)
  • SIRNA (http//bioweb.pasteur.fr/seqanal/interfaces
    /sirna.html)

http//harlequin.jax.org/GenomeAnalysis/GeneFindin
g04.ppt257,38,Non-protein Coding Gene Tools and
Information
13
Approaches to Gene Finding
  • Direct Homology based gene prediction
  • Exact or near-exact matches (similarity searches)
    of EST,
  • cDNA, or proteins from the same, or closely
    related
  • organisms (comparative genomics).
  • Indirect
  • Look for something that looks like one gene
    (homology).
  • Look for something that looks like all genes (ab
    initio).
  • Hybrid, combining homology and ab initio (and
    perhaps even direct) methods.

14
(No Transcript)
15
(No Transcript)
16
General Things to Remember about
(Protein-Coding) Gene Prediction Software
  • Works best on genes that are reasonably similar
    to something seen previously.
  • Finds protein coding regions far better than
    non-coding regions. First and last exons are
    difficult to annotate because they contain UTRs.
    Small genes are not statistically significant and
    therefore hard to predict.
  • In the absence of external (direct) information,
    alternative forms will not be identified.
  • It is imperfect ! (Its biology, after all).

17
Finding Eukaryotic Genes Computationally
  • Gene finding based on homology evidence BLAST,
    FASTA, BLAT etc.
  • Content-based Methods
  • CpG islands, GC content, hexamer repeats,
    composition statistics, codon frequencies
  • Feature-based Methods
  • donor sites, acceptor sites, promoter sites,
    start/stop codons, polyA signals, feature lengths
  • Similarity-based Methods
  • sequence homology, EST searches
  • Pattern-based
  • HMMs, Artificial Neural Networks
  • Most effective is a combination of all the above !

18
Content-Based Methods
  • CpG islands (in and near approximately 40 of
    promoters of mammalian genes (about 70 in human
    promoters)
  • Very abundant near gene start site
  • High GC content found in 5 ends of genes
  • Codon Bias
  • Some codons are strongly preferred in coding
    regions, others are not
  • Positional Bias
  • 3rd base tends to be G/C rich in coding regions

19
Feature-Based Methods
  • Based on identifying gene signals (promoter
    elements, splice sites, start/stop codons, polyA
    sites, etc.)
  • Wide range of methods
  • Consensus sequences
  • Weight matrices
  • Neural networks
  • Decision trees
  • Hidden Markov Models (HMMs)

20
Eukaryotic Genomic Hints that a Gene is Nearby
  • PolII RNA promoter elements
  • GC box, TATA box, CCAAT region.
  • Kozak consensus sequence (eukaryotic ribosome
    binding site-RBS consensus (gcc)gccRccAUGG,
    where R is a purine (adenine or guanine) three
    bases upstream of the start codon (AUG).
  • Splicing signals donor, acceptor.
  • Termination signal.
  • Polyadenylation signal.
  • Promoter elements.

21
Pol II Promoter Elements
  • 5 cap region/signal
  • (a modified guanine nucleotide that has been
    added to the "front" or 5' end of a eukaryotic
    messenger RNA shortly after the start of
    transcription, important for ribosome
    recognition).
  • nCAGTnG
  • TATA box (30 bp upstream)
  • - TATAAA
  • CCAAT box (80 bp upstream)
  • - TAGCCAATG
  • GC box (200 bp upstream)
  • - GGGCGG
  • None of these are essential for gene expression
  • Each of these may have more than one copy, except
    the TATA box, to produce greater effect
  • There are other promoter elements

22
Control of Gene Expression Promoter Prediction
23
How does it Works Motif Identification
  • Exon-Intron Borders Splice Sites

Exon Intron
Exon  gaggcatcagGTttgtagactgtgtttcAG
tgcacccact ccgccgctgaGTgagccgtgtc
tattctAGgacgcgcggg tgtgaattagGTaagaggtt
atatctccAGatggagatca ccatgaggagGTgagtg
ccattatttccAGgtatgagacg
Splice site Splice site
Motif Extraction Programs at http//www-btls.jst.g
o.jp/ Tools for genome analysis
http//www-btls.jst.go.jp/cgi-bin/Tools/index.cgi?
langen
http//www.seas.gwu.edu/simhaweb/cs177/spring2004
/zeeberglecture1Part2.ppt317,25,How it works I
Motif identification
24
Prediction of Splice Junction Sites
Splice site prediction tools - 3 splice site
CAG/GT 5 splice
site MAG/GTRAGT
(M is A or C R is A or G). http//l25.itba.mi.cn
r.it/webgene/wwwspliceview.html
http//dot.imgen.bcm.tmc.edu9331/seq-search/gene
-search.html Splice predictor
http//bioinformatics.iastate.edu/cgi-bin/sp.cgi
Splice site prediction by Neural Network
http//www.fruitfly.org/seq_tools/splice.html
25
Prediction of Translation Start Site
  • Translation start ATG
  • How to predict a translation start

ATG
GCCATGGCGA .. ACGATGCTGT . GACATGGTAC
AGGATGGGCT GCGATGTGGC
AUG codon finder tool http//www.cbs.dtu.dk/
services/NetStart/ http//l25.itba.mi.cnr.it/web
gene/wwwaug.html
26
Polyadenylation Termination Signals
  • Polyadenylation signal
  • AA/TTAAA
  • Located 20 bp upstream of poly-A cleavage site
  • Termination Signal
  • AGTGTTCA
  • Located 30 bp downstream of poly-A cleavage site

27
Why Polyadenylation is Really Useful
Complementary Base Pairing
AAAAAAAAAAA TTTTTTTTTTT
28
Links for Gene Finding Software
POLY-A signal prediction - 3 end of most
eukaryotic mRNAs (60-200 residues),
post-transcription added http//dot.imgen.bcm.tmc
.edu9331/seq-search/gene-search.html
http//l25.itba.mi.cnr.it/genebin/wwwHC_POLYA
http//genomic.sanger.ac.uk/gf/gf.shtml http//ru
lai.cshl.org/tools/polyadq/polyadq_form.html
Repeated elements (Repeat Masker) http//ftp.ge
nome.washington.edu/RM/RepeatMasker.html
http//l25.itba.mi.cnr.it/genebin/wwwrepeat.pl
GC rich areas, http//rulai.cshl.org/tools/CpG_p
romoter/
29
The Annotation Pipeline
  • Mask repeats using RepeatMasker
    (http//www.repeatmasker.org/).
  • Run sequence through several gene prediction
    programs.
  • Validation
  • Take predicted genes and do similarity search
    against ESTs and genes from other organisms.
  • Do similarity search for non-coding sequences to
    find ncRNA.

30
Eukaryotic Gene Prediction Tools
  • Genscan (ab initio), GenomeScan (hybrid)
  • (http//genes.mit.edu/)
  • (http//genes.mit.edu/genomescan.html)
  • Twinscan (hybrid)
  • (http//genes.cs.wustl.edu/)
  • FGENESH (ab initio)
  • (http//www.softberry.com/berry.phtml?topicgfind)
  • GeneMark.hmm (ab initio)
  • (http//opal.biology.gatech.edu/GeneMark/eukhmm.cg
    i)
  • MZEF (ab initio)
  • (http//rulai.cshl.org/tools/genefinder/)
  • GrailEXP (hybrid)
  • (http//grail.lsd.ornl.gov/grailexp/)
  • GeneID (hybrid)
  • (http//www1.imim.es/software/geneid/geneid.html
    )
  • FirstEF
  • - http//rulai.cshl.org/tools/FirstEF/

31
Gene Finding Software Promoter Prediction
Promoter Prediction McPromoter
http//genes.mit.edu/McPromoter.html Human
Promoter Prediction http//www.softberry.com/berr
y.phtml?topicfpromgroupprogramssubgrouppromot
er DNA Motif search (in TRANSFAC)
http//motif.genome.jp/ TFBind http//tfbind.im
s.u-tokyo.ac.jp/ CONSITE (predict TFBS for 1 or
2 sequences) http//mordor.cgb.ki.se/cgi-bin/CO
NSITE/consite Promoter prediction - DNA region
that RNA polymerase binds before initiating
transcription (TATA box prediction) http//www.fr
uitfly.org/seq_tools/promoter.html Transcription
factor binding sites prediction http//www.cbil.u
penn.edu/tess/index.html
32
Example - gene finding tool http//genes.mit.e
du/genomescan.html
33
When selecting ORF, you can blast the amino
acid sequence and get a protein.
http//www.ncbi.nlm.nih.gov/gorf/gorf.html
6 frames
34
GENSCAN output for sequence
Optimal exon
Initial exon
Internal exon
Terminal exon
Single Exon gene
Sub-Optimal exon
http//genes.mit.edu/
35
Gene Prediction Pipeline
  • Get a genomic sequence from genome browser
  • BRCA1 genomic NG_005905 Fasta format.
  • Apply to gene prediction tools provided
  • GENSCAN http//genes.mit.edu/GENSCAN.html
  • ORFinder http//www.ncbi.nlm.nih.gov/gorf/go
    rf.html
  • Also, worth trying
  • GeneBuilder http//l25.itba.mi.cnr.it/webgene/g
    enebuilder.html
  • Compare results.
  • Try CONSITE example for Transcription factor
    recognition
  • (http//mordor.cgb.ki.se/cgi-bin/CONSITE/consite)

Step-by-step practical exercise for gene
annotation http//www1.imim.es/eblanco/seminars/
docs/Reus2005/exercise1/index.html
36
Annotation of Putative Genes
Annotation category
  • Matches known protein sequence
  • Strong similarity to protein sequence
  • Similar to known protein
  • Similar to unknown protein
  • Similar to EST (i.e., putative protein)
  • No EST or protein matches (i.e., hypothetical
    protein)

37
Sequencing and Assembling a Genome
  • To sequence a genome, the first task is to cut it
    into many small, overlapping pieces.
  • Then clone each piece.

http//media.wiley.com/assets/1312/61/chapter05.pp
t317,15,Sequencing and Assembling a Genome (I)
38
Sequencing and Assembling a Genome
  • Each piece must be sequenced.
  • Sequencing machines cannot do an entire sequence
    at once.
  • They can only produce short sequences smaller
    than 1 Kb.
  • These pieces are called reads.
  • It is necessary to assemble the reads into
    contigs.

http//media.wiley.com/assets/1312/61/chapter05.pp
t322,16,Sequencing and Assembling a Genome (II)
39
Cleaning DNA Sequences
  • In order to sequence genomes, DNA sequences are
    often cloned in a vector (plasmid, YAC, or
    cosmide).
  • Sequences of the vector can be mixed with your
    DNA sequence.
  • Before working with your DNA sequence, you should
    always clean it with VecScreen

http//www.ncbi.nlm.nih.gov/VecScreen/VecScreen.ht
ml
40
Gene Assembly
Usually genes are found in fragments, and need to
be assembled Simple example input
ACCGT CGTGC TTAC TACCGT
Output --ACCGT-- ----CGTGC
TTAC----- -TACCGT--
Spaces are ignored. Fragments are
assembled according to overlapping areas.
___________ TTACCGTGC
A multiple alignment of the fragments result in
consensus sequence.
Assembled piece 9 bases
41
Sequence Assembly - Real life is more complicated
Errors in sequencing Example
Output --ACCGT--
----CGTGC TTAC-----
-TGCCGT--
Input ACCFT CGTGC TTAC TGCCGT
TTACCGTGC
The consensus is still correct because of
majority voting.
Insertion error Example
Output
--ACC-GT-- ----CAGTGC TTAC------ -TACC-GT--
Input ACCGT CAGTGC TTAC TACCGT
TTACC?GTGC
42
Gene Assembly - Real life is more complicated
Deletion error Example Input ACCGT Output
--ACCGT-- CGTGC
----CGTGC TTAC TTAC----- TACGT
-TAC_GT--
Repeats
TTACCGTGC
x
x
x
Lack of coverage
Target DNA
Uncovered area
43
Sequence Assembly
CAP3 http//pbil.univ-lyon1.fr/cap3.php http//bi
oweb.pasteur.fr/seqanal/interfaces/cap3.html
http//deepc2.psi.iastate.edu/aat/cap/cap.html
Example sequences (seq) http//genome.cs.mtu.edu/
cap/data/
44
Compute a Restriction Map
NEBcutter V2.0 http//tools.neb.com/NEBcutter2/in
dex.php http//bioinformatics.org/sms2/rest_map.ht
ml WatCut An on-line tool for restriction
analysis, silent mutation scanning, and SNP-RFLP
analysis http//watcut.uwaterloo.ca/watcut/watcut
/template.php Sequence Extractor generates a
clickable restriction map and PCR primer map of a
DNA sequence. Protein translations and
intron/exon boundaries are also shown
http//bioinformatics.org/seqext/
45
Perform PCR Using a Computer
  • Polymerase Chain Reaction (PCR) is a method for
    amplifying DNA.
  • PCR is used for many applications, including
  • Gene cloning
  • Forensic analysis and paternity tests
  • PCR amplifies the DNA between two anchors called
    PCR primers.
  • Hydrogen bonding of single-stranded nucleic acids
    is referred to as "annealing two complementary
    sequences will form hydrogen bonds between their
    complementary bases (G to C, and A to T or U) and
    form a stable double-stranded molecule.

primers
http//media.wiley.com/assets/1312/61/chapter05.pp
t309,6,Making PCR with a Computer
46
Annealing Temperature and Primer Design
The melting temperature (Tm) of a DNA duplex
increases both with its length, and with
increasing (GC) content and may be calculated
Tm 4(G C) 2(A T)oC. Degenerate Primers
are used for amplification of sequences from
different organisms a set of primers which have
a number of options at several positions in the
sequence so as to allow annealing to and
amplification of a variety of related sequences.
http//www.mcb.uct.ac.za/pcroptim.htm
47
PCR Primer Design Rules
1.  Primer length should be 17-28 bases in
length   2.  Base composition determines DNA
stability (50-60 GC content higher GC makes
more stable DNA)   3.  Primers should end (3')
in a G or C, or CG or GC this prevents
"breathing" of ends and increases efficiency of
priming   4.  Primer Tms (melting point)
between 55-80oC are preferred   5.  3'-ends of
primers should not be complementary (ie. base
pair), as otherwise primer dimers will be
synthesized preferentially to any other
product   (adapted from Innis and
Gelfand,1991 http//bioweb.uwlax.edu/GenWeb/Mole
cular/Seq_Anal/Primer_Design/primer_design.htm )
48
Primer Design
6.  Primer self-complementarity (ability to form
secondary structures such as hairpins) should be
avoided   7.  Runs of three or more Cs or Gs at
the 3'-ends of primers may promote mis-priming at
G or C-rich sequences (because of stability of
annealing), and should be avoided.   8. Target
sequence Amplicon length. 9. Check cross
homology with related genes and possible
pseudo-genes. Amplicon location (distance from
3'end). 10. Intron spanning.
Read about primer design http//www.mcb.uct.ac.za
/manual/pcroptim.htm
49
Primer Design Tools
Primer3 http//biotools.umassmed.edu/bioapps/pri
mer3_www.cgi PrimerQuest (also option for
RT-PCR) http//www.idtdna.com/Scitools/Applicatio
ns/PrimerQuest PCR primers based upon
multialignments http//cgi-www.daimi.au.dk/cgi-ch
ili/PriFi/main?config.x101config.y30 PCR
primers designed from protein multiple sequence
alignments http//bioinformatics.weizmann.ac.il/bl
ocks/codehop.html GeneFisher input single or
multiple sequence(s), either nucleotide or amino
acid. For multiple sequences a multiple alignment
will be calculated (or upload already aligned
sequences). http//bibiserv.techfak.uni-bielefel
d.de/genefisher2/submission.html (example
sequences in web-site). Web Primer
http//seq.yeastgenome.org/cgi-bin/web-primer
50
Special Features Primer Design
Create overlapping PCR products in large
sequences. http//www2.eur.nl/fgg/kgen/primer/Over
lapping_Primers.html Creating primers around
exons in genomic DNA. http//www2.eur.nl/fgg/kgen/
primer/Genomic_Primers.html Creating primers
around SNPs in genomic DNA. http//www2.eur.nl/fg
g/kgen/primer/SNP_Primers.html Creating primers
around the Open Reading Frame of cDNAs.
http//www2.eur.nl/fgg/kgen/primer/cDNA_Primers.h
tml
51
Examine Primer Properties
NetPrimer http//molbiol-tools.ca/PCR.htm Blast
primer sequence against NCBI and provide primer
propertieshttp//www.idtdna.com/analyzer/Applica
tions/OligoAnalyzer/ OligoCalculator
Oligonucleotide properties calculator http//www.
basic.northwestern.edu/biotools/oligocalc.html
OligoAnalyzer http//www.idtdna.com/analyzer/Ap
plications/OligoAnalyzer/
52
Other Primer Utilities
UCSC In-Silico PCR UCSC http//genome.ucsc.ed
u/cgi-bin/hgPcr?commandstart Electronic PCR
(e-PCR) is computational procedure that is used
to identify sequence tagged sites(STSs), within
DNA sequences http//www.ncbi.nlm.nih.gov/sutils/
e-pcr/ In silico simulation of molecular
biology experiments http//insilico.ehu.es/
http//insilico.ehu.es/PCR/
53
Primers Design Pipeline
  • Use only accurate sequence data !
  • Restrict primer search to regions that best
    reflect goals
  • Locate candidate primers
  • Discard candidate primers that show
    undesirable self-hybridization
  • Verify the site-specificity of the primer

54
Primers Design Pipeline
  • Question
  • Hybrid cells are formed containing human BRCA1
    cDNA in mouse cells.
  • We want to identify cells which had incorporated
    the human cDNA, using PCR primers.
  • Steps
  • Look for mouse and human BRCA1 in NCBI/GENE
    databse.
  • Locate BRCA1 cDNA sequence (is there only 1
    transcript ?).
  • If there are more than 1 transcripts, compare
    between them (lets take only 3 for this example)
    and find the region that would identify all
    transcripts (hint use http//www.ebi.ac.uk/Tools
    /clustalw/index.html).
  • Find primer-pairs for PCR (http//frodo.wi.mit.e
    du/primer3/input.htm).
  • Check primer specificity (human only, not
    mouse)
  • UCSC http//genome.ucsc.edu/cgi-bin/hgPcr?comman
    dstart
  • NCBI (BLAST) http//genome.ucsc.edu/cgi-bin/hgPc
    r?commandstart
  • Check primer properties (http//www.basic.north
    western.edu/biotools/oligocalc.html)

55
Bioinformatics - Past and Present
ORTHOLOG GENES (Taxonomy)
SEQUENCE ALIGNMENT
CODING REGIONS
CONSERVED DOMAINS
SEQUENCES LITERATURE
3-D STRUCTURE
GENE FAMILIES
SIGNAL PEPTIDE
MUTATIONS POLYMORPHISM
GENOME MAPS
CELLULAR LOCATION
56
Bioinformatics - Present and Future
ORTHOLOG GENES (Taxonomy)
SEQUENCE ALIGNMENT
CODING REGIONS
CONSERVED DOMAINS
GENE EXPRESSION, GENES FUNCTION, DRUG PERSONAL
THERAPY
3-D STRUCTURE
GENE FAMILIES
SIGNAL PEPTIDE
MUTATIONS POLYMORPHISM
GENOME MAPS
CELLULAR LOCATION
57
Are We Done ?
Now this is not the end. It is not even the
beginning of the end. But it is perhaps, the
end of the beginning. Winston Churchill, 1942
(3 years into WW2)
http//www.globecartoon.com/neweconomy/10.html
58
Thank-you !
Dr. Metsada Pasmanik-Chor Bioinformatics Unit,
001 Sherman Bldg. Faculty of Life Science,
TAU Tel x 6992 E-mail metsada_at_bioinfo.tau.ac.il
Bioinfo. Unit webpage http//www.tau.ac.il/lifes
ci/bioinfo/
Write a Comment
User Comments (0)
About PowerShow.com