Human Genome Project - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Human Genome Project

Description:

... genome were printed in the same font used by most encyclopedias, it would be ... Given a set of genes, find out what is common among them ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 35
Provided by: michaelsr
Category:

less

Transcript and Presenter's Notes

Title: Human Genome Project


1
Human Genome Project
  • In early 2001, it was announced that the entire
    human genome had been sequenced
  • Lets pretend this was true
  • It wasnt really true, it was more of a publicity
    stunt
  • It was a very rough draft
  • Even today, 5 years later, it is not quite done
    although it is much closer

2
Human Genome Project
  • What does the sequencing of the human genome
    mean?
  • We now have 24 very long strings of letters (only
    A, T, C, G) representing the human genome
  • These strings range from 55Mb to 220Mb, for a
    total of 3.2Gb
  • How big is this?
  • If the entire human genome were printed in the
    same font used by most encyclopedias, it would be
    10 times longer than the 2002 edition of
    Encyclopedia Britanica
  • No information on variability or frequency
  • Represents a single snapshot of the genome
  • What good is it?

3
Genomics
  • The general process of finding things within a
    sequence is known as Genome Annotation
  • How can we identify coding genes in a sequence?

4
Finding Coding Genes
  • Experimentally
  • How does one find a gene?
  • DNA ? mRNA ? Amino Acid/Protein
  • Grab the mRNAs
  • Use reverse transcription to create cDNA
  • Look for the cDNA in the genomethis tells you
    where the gene is
  • Use BLAST or a similar tool

5
Finding Coding Genes
  • Complication Introns
  • In the mRNA the introns have been removed, thus
    our cDNA also does not contain introns
  • We need to find pieces of contiguous cDNA but
    should not expect to find the entire cDNA
    sequence as one large chunk
  • What if we do find the entire cDNA?
  • mRNA will occasionally be reverse transcribed and
    placed back in the genome, often at random points
  • This is known as a Processed Pseudogene
  • Its a pseudogene because while the complete
    coding sequence is present, there are usually no
    control regions
  • In Eukaryotes, almost all intronless genes turn
    out to be pseudogenes

6
Finding Coding Genes
  • We can find genes experimentallydoes this help
    us finding them computationally?
  • What we really want is to find genes in a
    sequence de novo, not experimentally
  • How might one do this?
  • What would you look for?
  • Known genes from other organisms?
  • This would still require that the gene be
    identified experimentally in the other organism
  • Does not help identify unique genes
  • Start with a collection of genes already
    identified through experiments
  • Identify commonalities

7
Finding Coding Genes
  • Given a set of genes, find out what is common
    among them
  • Search the full genome for these common features
  • Include up- and downstream information, not just
    the coding sequence
  • What are the common features of genes? What
    should we be looking for?

8
Prokaryotic Genes
  • Prokaryotes are bacteria and similar unicellular
    organisms
  • Prokaryotic genomes are simpler than eukaryotic
    genomes
  • Smaller genomes
  • High gene density, approximately 1 gene/kb on
    average
  • No introns
  • Little repetitive DNA

9
Prokaryotic Genes
  • Open Reading Frame (ORF)
  • A stretch of DNA bracketed by a start codon (ATG)
    and a stop codon (TGA, TAA, or TAG)
  • The start codon can be tricky since ATG could be
    the beginning or the middle of a sequence (with
    another ATG upstream). But, the fact that
    another ATG is upstream does not automatically
    mean the second one is not the proper start codon
  • Any stretch of DNA has six potential reading
    frames, so we have to examine all six for ORFs
  • Are all ORFs coding genes?
  • No, but all coding genes must contain an ORF?

10
Open Reading Frames
  • Once weve found an ORF, how can we tell if it
    might be a a gene?
  • Length
  • How many codons are there?
  • 64
  • How many stop codons are there?
  • 3
  • What is the probability of any given codon being
    a stop codon?
  • Approximately 1/21
  • Thus for any random stretch of codons, on average
    one would expect a stop codon every 21 codons

11
Open Reading Frames
  • Most prokaryotic genes are longer than 60 codons
  • In E. coli the average gene is 316 codons long
  • Statistically, if all codons appear with equal
    frequency in a random sequence, the probability
    that a stop codon will not occur over N codons is
    (61/64)N
  • The ORF length at which we expect random sequence
    to have less than a 5 chance of not containing a
    stop codon happens to be N 60

12
Open Reading Frames
  • Other features of ORFs Codon usage
  • When we examine a lot of known genes, we find
    that some codons are quite common while others
    are rare
  • Some amino acids are more common than others
  • Even for the same amino acid, not all synonymous
    codons will occur with equal frequency
  • Known as Codon Usage Bias
  • This is true even when nucleotide biases are
    taken into account
  • It probably has something to do with the
    efficiency of gene expression

13
Codon Usage Bias
  • Codon usage biases tend to be species specific

Percentage of use of Serine codons in different
species
  • One can calculate the proportion of codon usage
    for all known genes
  • If a proposed ORF has similar proportions of
    codons as known genes, it is probably a real gene
  • If it tends to have high proportions of rarely
    used codons, it may not be a real gene

14
Promoter Elements
  • Promoters tell the transcription machinery (e.g.,
    RNA polymerase) where to start
  • Are there common promoters?
  • In E. coli the following consensus sequences are
    often found upstream of the start codon
  • The better the -35 and -10 upstream region match
    a pair of these sequences, the higher the odds
    that it is a real gene

15
Promoter Elements
  • Not every gene in prokaryotes has a promoter!
  • Genes in prokaryotes often work in sets called
    Operons
  • An operon consists of a single promoter followed
    by multiple coding sequences
  • All of the coding sequences are transcribed
    simultaneously when the single promoter is
    activated
  • Genes in operons usually work together as parts
    of a single system

16
Operons
  • Operon Example Lactose Operon
  • Three Genes
  • Beta-galactosidase
  • Lactose permease
  • Lactose transacetylase
  • Involved in the metabolism of lactose sugar
  • Transcription creates one long mRNA, pieces of
    which are translated into three separate proteins

17
Operons
  • How can you identify separate genes within one
    operon?
  • There needs to be a sequence in front of the
    start codon for the ribosome to attach to
    (something other than a promoter)
  • Known as Shine-Delgano sequences
  • AGGAGGU is found in front most true start codons
    in mRNA

18
Operons
  • Most operons have termination signals after the
    stop codons
  • Inverted repeats CGGATGCATCCG
  • This is identical on the reverse strand
  • A stretch of 6 U follows the repeat
  • These are found in gt 90 of prokaryotic operons
  • These inverted repeats are 7 to 20 nucleotides
    long and usually very heavy in G and C

19
Prokaryotic Genes
  • Complications
  • Lateral gene transfer ( Horizontal gene
    transfer)
  • When a species picks up a gene from another
    species rather than a direct ancestor
  • Because different species have different GC
    contents and different codon usage patterns,
    lateral gene transfers will often appear very
    different from the other genes within the species
  • This makes them easier to identify as foreign
    genesif you can identify them as genes in the
    first place!

20
Eukaryotic Genes
  • Eukaryotes are much more complicated than
    prokaryotes
  • Introns
  • Break up the reading frame and make sequences
    much longer
  • Tens of thousands of genes
  • Lower gene density
  • In worms and insects, 25 of the genome codes
    for proteins
  • In humans it is closer to 3
  • Lots of repetitive DNA
  • Like searching for a needle in a haystack?
  • A 2 gram needle in a 6kg haystack would be 1000
    times easier to find if genes were as different
    from the rest of the genome as a needle is to
    straw

21
Eukaryotic Genes
  • Additional reading (available on website)
  • Zhang (2002) Computational prediction of
    Eukaryotic protein-coding genes Nature Review
    Genetics 3698-709.
  • This is a good overview of the techniques and
    problems involved with finding eukaryotic coding
    genes

22
Exons
  • There are 5 types of exons
  • Nonconding exons representing 5 and 3
    untranslated regions
  • Initial exons containing the start codon
  • Termination exons containing the stop codon
  • Internal exons containing neither start nor stop
    codons
  • Single exon genes, containing no introns and both
    start and stop codons
  • In general, codons are more difficult to identify
    in eukaryotes because the reading frame (and even
    individual codons) is broken up by the introns

23
Exons
  • Terminal (3) exons are easier to identify than
    initial (5) exons
  • Many eukaryotic genes have AAUAAA or AAUUAAA
    tails some distance after the stop codon
  • Promoters are harder to identify in eukaryotes
    than prokaryotes
  • Many eurkaryotic genes have a TATA box
  • Only 70 of human genes have this
  • TATAWAW at position 25 (W indicates an A or a T)
  • Initiator sequence (where transcription actually
    beginsnot the start codon)
  • YYCARR with the A being the actual start of
    transcription
  • Transcription factors (around 80)
  • Most eukaryotes have CAAT (consensus is GCCAATCT)

24
Internal Exons and Introns
  • Exons often end with GT and begin with AG
    (outside the actual coding sequence)
  • At least 8 different kind of introns in
    Eukaryotes (i.e., at least 8 different ways that
    introns get spliced out)
  • The most common intron is known as the GU-AG
  • The first two nucleotides of the intron are GU
  • The last two nucleotides of the intron are AG
  • In yeast genes
  • AGG100T100A100A100G100T10012PYNC100A100G100GT
  • Introns must be at least 60 bases long (shorter
    introns cannot be recognized)
  • Exons in vertebrates average 450bp, although some
    are lt 100bp and others are gt 2000bp

25
Introns
  • The number of introns in a gene varies
    tremendously
  • The 6,000 genes of yeast have a total of 239
    introns
  • There are individual human genes with more than
    100 introns
  • 95 of all human genes have at least 1 intron

26
GC content
  • GC content does not vary as much from species to
    species in Eukaryotes as in Prokaryotes
  • Within species, however, GC content can be
    structured into isochores
  • Genes are associated with certain isochores
    isochores with high GC content have a much higher
    proportion of genes than isochores with low GC
    content
  • Highest GC isochore in humans 3 of genome, but
    contains 80 of housekeeping genes
  • Lowest GC isochore in humans 66 of genome, but
    contains 85 of tissue-specific genes
  • L1 39 L2 42 H1 46 H2 49 H3
    54
  • CpG islands
  • Every known housekeeping gene is found in a CpG
    island
  • A housekeeping gene is a gene expressed in all
    tissues at all times

27
Algorithms/Software
  • Common ones are Grail EXP and GenScan
  • No specifics will be discussed, but they
    generally come in three types
  • Strict Rule-based
  • Neural networks
  • Hidden Markov Models
  • Rules/training is species dependent e.g., a rule
    set derived for Drosophila may fail in humans
  • However, comparative searches as already
    discussed (BLAST for Drosophila genes in humans)
    can help speed up the search for human genes to
    use in training algorithms

28
Algorithms/Software
  • How good are the current approaches?
  • So-so
  • High false positive rate
  • Identify genes that are not actually genes
  • Fail to accurately predict complete genes
  • Find some of the exons, but mess up the introns
  • In 2000, the best predictor found true positives
    at about 40 accuracy and rejected true negative
    at about 30 accuracy
  • Today, these numbers might be closer to 50
  • One approach that helps is to use multiple
    programs/algorithms and look for consensus among
    their results
  • If all of the programs identify something as a
    gene, there is a better chance that its correct
    than if only one found it

29
Algorithms/Software
  • In the long run, even if a gene is identified
    with high confidence
  • It needs to be confirmed experimentally
  • Its function needs to be determined
  • Function is most likely to be determined through
    experiments, but there are computational ways to
    make guesses
  • For example, runs of 20-25 hydrophobic amino
    acids often indicate transmembrane domains

30
Non-coding elements
  • What about things other than coding genes?
  • 5 of the mammalian genome is functional
  • Only 1-2 is actually coding sequence
  • What about the other 3-4
  • Noncoding RNAs
  • Transcription factors and other regulatory sites
  • How do we find these?
  • Similar methods, for the most part

31
Non-coding RNAs
  • Without the codon structure, they are trickier to
    find than coding genes
  • They have promoters and transcription factors,
    but no start/stop codons or introns.
  • They can also be very short

32
MicroRNAs (miRNAs)
  • A large family of noncoding RNAs which help
    regulate gene activity
  • They are only 21-22 nucleotides long
  • How to identify them?
  • The precursor is 70-100 nucleotides long with a
    lot of stem-loop structure
  • Highly conserved among closely related species
  • Diverge in a characteristic way changes tend to
    take place in a specific part of the stem
  • Search for 100 nucleotides segments in a pair of
    species that differ by fewer than 15 of sites
    with fewer than 13 gaps
  • Analyze each of these conserved regions for
    secondary RNA structure
  • If the predicted structure looks correct, it is
    probably a miRNA

33
Other functional elements
  • Primary approach is to look at multiple, closely
    related species and look for conserved upstream
    sequence
  • searching for cis- acting elements
  • trans- acting elements are essentially impossible
    to identify in this context
  • Sequences which show few changes over time are
    presumably under selection and therefore must
    serve some function
  • Perform multiple local alignments
  • This approach is known as Functional or
    Phylogenetic Footprinting

34
Regulatory sequences
  • Usually vaguely close to a gene, within a few kb
  • However, they can be extremely short (e.g., just
    6 bases)
  • They can be in any orientation
  • Upstream/downstream
  • Either direction on either strand
  • Since finding any one potential regulatory
    sequence is easy to do by chance, it is better to
    look for multiple small sequences that are found
    near each other
  • Look for word 1a spacer of some lengthword 2
  • Use statistical analysis to determine if some
    combinations of words are found more often than
    expected by chance
  • These are likely to have functional significance
Write a Comment
User Comments (0)
About PowerShow.com