CS 5263 Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

CS 5263 Bioinformatics

Description:

CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology Outline Administravia What is bioinformatics Why bioinformatics Course ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 97
Provided by: jru49
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics


1
CS 5263 Bioinformatics
  • Lectures 1 2 Introduction to Bioinformatics
    and Molecular Biology

2
Outline
  • Administravia
  • What is bioinformatics
  • Why bioinformatics
  • Course overview
  • Short introduction to molecular biology

3
Survey form
  • Your name
  • Email
  • Academic preparation
  • Interests
  • help me better design lectures and assignments

4
Course Info
  • Instructor Jianhua Ruan
  • Office S.B. 4.01.48
  • Phone 458-6819
  • Email jruan_at_cs.utsa.edu
  • Office hours MW 2-3pm
  • Web http//www.cs.utsa.edu/jruan/teaching/cs5263
    _fall_2008/

5
Course description
  • A survey of algorithms and methods in
    bioinformatics, approached from a computational
    viewpoint.
  • Prerequisite
  • Programming experiences
  • Some knowledge in algorithms and data structures
  • Basic understanding of statistics and probability
  • Appetite to learn some biology

6
Textbooks
  • An Introduction to Bioinformatics Algorithms
  • by Jones and Pevzner
  • Biological Sequence Analysis Probabilistic
    Models of Proteins and Nucleic Acids
  • by Durbin, Eddy, Krogh and Mitchison
  • Additional resources
  • Papers
  • Handouts
  • See course website

7
Grading
  • Attendance 10
  • At most 2 classes missed without affecting grade
  • Homeworks 50
  • About 5 assignments
  • Combination of theoretical and programming
    exercises
  • No exams
  • No late submission accepted
  • Read the collaboration policy!
  • Final project and presentation 40

8
Why bioinformatics
  • The advance of experimental technology has
    generated huge amount of data
  • The human genome is finished
  • Even if it were, thats only the beginning
  • The bottleneck is how to integrate and analyze
    the data
  • Noisy
  • Diverse

9
Growth of GenBank vs Moores law
10
Genome annotations
Meyer, Trends and Tools in Bioinfo and Compt Bio,
2006
11
What is bioinformatics
  • National Institutes of Health (NIH)
  • Research, development, or application of
    computational tools and approaches for expanding
    the use of biological, medical, behavioral or
    health data, including those to acquire, store,
    organize, archive, analyze, or visualize such
    data.

12
What is bioinformatics
  • National Center for Biotechnology Information
    (NCBI)
  • the field of science in which biology, computer
    science, and information technology merge to form
    a single discipline. The ultimate goal of the
    field is to enable the discovery of new
    biological insights as well as to create a global
    perspective from which unifying principles in
    biology can be discerned.

13
What is bioinformatics
  • Wikipedia
  • Bioinformatics refers to the creation and
    advancement of algorithms, computational and
    statistical techniques, and theory to solve
    formal and practical problems posed by or
    inspired from the management and analysis of
    biological data.

14
(No Transcript)
15
Course objectives
  • Learn the basis of sequence analysis and other
    computational biology algorithms
  • Familiarize with the research topics in
    bioinformatics
  • Be able to
  • Read / criticize bioinformatics research articles
  • Identify subareas that best suit your background
  • Communicate and exchange ideas with
    (computational) biologists

16
What you will learn?
  • Basic concepts in molecular biology and genetics
  • Algorithms to address selected problems in
    bioinformatics
  • Dynamic programming, string algorithms, graph
    algorithms
  • Statistical learning algorithms HMM, EM, Gibbs
    sampling
  • Data mining clustering / classification
  • Applications to real data

17
What you will not learn?
  • Designing / performing biological experiments
    (duh!)
  • Programming (in perl, etc).
  • Building bioinformatics software tools (GUI,
    database, Web, )
  • Using existing tools / databases (well, not
    exactly true)

18
Covered topics
1 week
  • Biology
  • Sequence analysis
  • Sequence alignment
  • Pairwise, multiple, global, local, optimal,
    heuristic
  • String matching
  • Motif finding
  • Gene prediction
  • RNA structure prediction
  • Phylogenetic tree
  • Functional Genomics
  • Microarray data analysis
  • Biological networks

8 weeks
5 weeks
19
Computer Scientists vs Biologists(courtesy
Serafim Batzoglou, Stanford)
20
Biologists vs computer scientists
  • (almost) Everything is true or false in computer
    science
  • (almost) Nothing is ever true or false in Biology

21
Biologists vs computer scientists
  • Biologists seek to understand the complicated,
    messy natural world
  • Computer scientists strive to build their own
    clean and organized virtual world

22
Biologists vs computer scientists
  • Computer scientists are obsessed with being the
    first to invent or prove something
  • Biologists are obsessed with being the first to
    discover something

23
  • Some examples of central role of CS in
    bioinformatics

24
1. Genome sequencing
3x109 nucleotides
25
1. Genome sequencing
3x109 nucleotides
A big puzzle 60 million pieces
Computational Fragment Assembly Introduced
1980 1995 assemble up to 1,000,000 long DNA
pieces 2000 assemble whole human genome
26
2. Gene Finding
Where are the genes?
In humans 22,000 genes 1.5 of human DNA
27
2. Gene Finding
Hidden Markov Models (Well studied for many years
in speech recognition)
28
3. Protein Folding
  • The amino-acid sequence of a protein determines
    the 3D fold
  • The 3D fold of a protein determines its function
  • Can we predict 3D fold of a protein given its
    amino-acid sequence?
  • Holy grail of compbio40 years old problem
  • Molecular dynamics, computational geometry,
    machine learning

29
4. Sequence ComparisonAlignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCG
GTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
x

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Sequence Alignment Introduced 1970 BLAST
1990, most cited paper in history Still very
active area of research
BLAST
Efficient string matching algorithms Fast
database index techniques
30
Lipman Pearson, 1985
, comparison of a 200-amino-acid sequence to the
500,000 residues in the National Biomedical
Research Foundation library would take less than
2 minutes on a minicomputer, and less than 10
minutes on a microcomputer (IBM PC).
Database size today 1012 (increased by 2 million
folds). BLAST search 1.5 minutes
31
5. Microarray analysisClinical prediction of
Leukemia type
  • 2 types
  • Acute lymphoid (ALL)
  • Acute myeloid (AML)
  • Different treatments outcomes
  • Predict type before treatment?

Bone marrow samples ALL vs AML
Measure amount of each gene
32
Some goals of biology for the next 50 years
  • List all molecular parts that build an organism
  • Genes, proteins, other functional parts
  • Understand the function of each part
  • Understand how parts interact physically and
    functionally
  • Study how function has evolved across all species
  • Find genetic defects that cause diseases
  • Design drugs rationally
  • Sequence the genome of every human, use it for
    personalized medicine
  • Bioinformatics is an essential component for all
    the goals above

33
  • A short introduction to molecular biology

34
Life
  • Two categories
  • Prokaryotes (e.g. bacteria)
  • Unicellular
  • No nucleus
  • Eukaryotes (e.g. fungi, plant, animal)
  • Unicellular or multicellular
  • Has nucleus

35
Prokaryote vs Eukaryote
  • Eukaryote has many membrane-bounded compartment
    inside the cell
  • Different biological processes occur at different
    cellular location

36
Organism, Organ, Cell
Organism
37
Chemical contents of cell
  • Water
  • Macromolecules (polymers) - strings made by
    linking monomers from a specified set (alphabet)
  • Protein
  • DNA
  • RNA
  • Small molecules
  • Sugar
  • Ions (Na, Ka, Ca2, Cl- ,)
  • Hormone

38
DNA
  • DNA forms the genetic material of all living
    organisms
  • Can be replicated and passed to descendents
  • Contains information to produce proteins
  • To computer scientists, DNA is a string made from
    alphabet A, C, G, T
  • e.g. ACAGAACGTAGTGCCGTGAGCG
  • Each letter is a nucleotide
  • Length varies from hundreds to billions

39
RNA
  • Historically thought to be information carrier
    only
  • DNA gt RNA gt Protein
  • New roles have been found for them
  • To computer scientists, RNA is a string made from
    alphabet A, C, G, U
  • e.g. ACAGAACGUAGUGCCGUGAGCG
  • Each letter is a nucleotide
  • Length varies from tens to thousands

40
Protein
  • Protein the actual worker for almost all
    processes in the cell
  • Enzymes speed up reactions
  • Signaling information transduction
  • Structural support
  • Production of other macromolecules
  • Transport
  • To computer scientists, protein is a string made
    from 20 kinds of characters
  • E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP
  • Each letter is called an amino acid
  • Length varies from tens to thousands

41
DNA/RNA zoom-in
  • Commonly referred to as Nucleic Acid
  • DNA Deoxyribonucleic acid
  • RNA Ribonucleic acid
  • Found mainly in the nucleus of a cell (hence
    nucleic)
  • Contain phosphoric acid as a component (hence
    acid)
  • They are made up of a string of nucleotides

42
Nucleotides
  • A nucleotide has 3 components
  • Sugar ring (ribose in RNA, deoxyribose in DNA)
  • Phosphoric acid
  • Nitrogen base
  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Thymine (T) or Uracil (U)

43
Monomers of RNA ribo-nucleotide
  • A ribonucleotide has 3 components
  • Sugar - Ribose
  • Phosphate group
  • Nitrogen base
  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Uracil (U)

44
Monomers of DNA deoxy-ribo-nucleotide
  • A deoxyribonucleotide has 3 components
  • Sugar Deoxy-ribose
  • Phosphate group
  • Nitrogen base
  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Thymine (T)

45
Polymerization Nucleotides gt nucleic acids
46
5
Free phosphate
5 prime
3 prime
5-AGCGACTG-3
AGCGACTG
DNA
Often recorded from 5 to 3, which is the
direction of many biological processes. e.g. DNA
replication, transcription, etc.
Base
5
Phosphate
Sugar
1
4
2
3
3
47
5
Free phosphate
5 prime
3 prime
5-AGUGACUG-3
AGUGACUG
RNA
Often recorded from 5 to 3, which is the
direction of many biological processes. e.g.
translation.
3
48
3
5
Base-pair A T G C
Forward () strand
5-AGCGACTG-3 3-TCGCTGAC-5
Backward (-) strand
AGCGACTG TCGCTGAC
One strand is said to be reverse- complementary
to the other
5
3
DNA usually exists in pairs.
49
DNA double helix
G-C pair is stronger than A-T pair
50
Reverse-complementary sequences
  • 5-ACGTTACAGTA-3
  • The reverse complement is
  • 3-TGCAATGTCAT-5
  • gt
  • 5-TACTGTAACGT-3
  • Or simply written as
  • TACTGTAACGT

51
Orientation of the double helix
  • Double helix is anti-parallel
  • 5 end of each strand pairs with 3 end of the
    other
  • 5 to 3 motion in one strand is 3 to 5 in the
    other
  • Double helix has no orientation
  • Biology has no forward and reverse strand
  • Relative to any single strand, there is a
    reverse complement or reverse strand
  • Information can be encoded by either strand or
    both strands
  • 5TTTTACAGGACCATG 3
  • 3AAAATGTCCTGGTAC 5

52
RNA
  • RNAs are normally single-stranded
  • Form complex structure by self-base-pairing
  • AU, CG
  • Can also form RNA-DNA and RNA-RNA double strands.
  • AT/U, CG

53
Protein zoom-in
  • Protein is the actual worker for almost all
    processes in the cell
  • A string built from 20 letters
  • E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH
  • Each letter is called an amino acid
  • R
  • H2N--C--COOH
  • H

Side chain
Generic chemical form of amino acid
54
Amino acid
  • 20 amino acids, only differ at side chains
  • Each can be expressed by three letters
  • Or a single letter A-Y, except B, J, O, U, X, Z
  • Alanine Ala A
  • Histidine His H

55
Amino acids gt peptide
R
R
H2N--C--COOH
H2N--C--COOH

H H
R R

H2N--C--CO--NH--C--COOH

H H

Peptide bond
56
Protein
  • Has orientations
  • Usually recorded from N-terminal to C-terminal
  • Peptide vs protein basically the same thing
  • Conventions
  • Peptide is shorter (lt 50aa), while protein is
    longer
  • Peptide refers to the sequence, while protein has
    2D/3D structure

57
Protein structure
  • Linear sequence of amino acids folds to form a
    complex 3-D structure.
  • The structure of a protein is intimately
    connected to its function.

58
Genome and chromosome
  • Genome the complete DNA sequences in the cell of
    an organism
  • May contain one (in most prokaryotes) or more (in
    eukaryotes) chromosomes
  • Chromosome a single large DNA molecule in the
    cell
  • May be circular or linear
  • Contain genes as well as junk DNAs
  • Highly packed!

59
Formation of chromosome
60
Formation of chromosome
50,000 times shorter than extended DNA
The total length of DNA present in one adult
human is the equivalent of nearly 70 round trips
from the earth to the sun
61
Gene
  • Gene unit of heredity in living organisms
  • A segment of DNA with information to make a
    protein

62
Some statistics
Chromosomes Bases Genes
Human 46 3 billion 20k-25k
Dog 78 2.4 billion 20k
Corn 20 2.5 billion 50-60k
Yeast 16 20 million 7k
E. coli 1 4 million 4k
Marbled lungfish ? 130 billion ?
63
Human genome
  • 46 chromosomes 22 pairs X Y
  • 1 from mother, 1 from father
  • Female X X
  • Male X Y

64
Human genome
  • Every cell contains the same genomic information
  • Except sperms and eggs, which only contain half
    of the genome
  • Otherwise your children would have 46 46
    chromosomes

65
Cell division mitosis
  • A cell duplicates its genome and divides into two
    identical cells
  • These cells build up different parts of your body

66
Cell division meiosis
  • A reproductive cell divides into four cells, each
    containing only half of the genomes
  • Diploid gt haploid
  • Two haploid cells (sperm egg) forms a zygote
  • Which will then develop into a multi-cellular
    organism by mitosis

67
Central dogma of molecular biology
DNA replication is critical in both mitosis and
meiosis
68
DNA Replication
  • The process of copying a double-stranded DNA
    molecule
  • Semi-conservative
  • 5-ACATGATAA-3
  • 3-TGTACTATT-5
  • ?
  • 5-ACATGATAA-3 5-ACATGATAA-3
  • 3-TGTACTATT-5 3-TGTACTATT-5

69
  • Mutation changes in DNA base-pairs
  • Proofreading and error-correcting mechanisms
    exist to ensure extremely high fidelity

70
Central dogma of molecular biology
71
Transcription
  • The process that a DNA sequence is copied to
    produce a complementary RNA
  • Called message RNA (mRNA) if the RNA carries
    instruction on how to make a protein
  • Called non-coding RNA if the RNA does not carry
    instruction on how to make a protein
  • Only consider mRNA for now
  • Similar to replication, but
  • Only one strand is copied

72
Transcription
(where genetic information is stored)
DNA-RNA pair AU, CG TA, GC
(for making mRNA)
Coding strand 5-ACGTAGACGTATAGAGCCTAG-3 Tem
plate strand 3-TGCATCTGCATATCTCGGATC-5 mRNA
5-ACGUAGACGUAUAGAGCCUAG-3
Coding strand and mRNA have the same sequence,
except that Ts in DNA are replaced by Us in
mRNA.
73
Translation
  • The process of making proteins from mRNA
  • A gene uniquely encodes a protein
  • There are four bases in DNA (A, C, G, T), and
    four in RNA (A, C, G, U), but 20 amino acids in
    protein
  • How many nucleotides are required to encode an
    amino acid in order to ensure correct
    translation?
  • 41 4
  • 42 16
  • 43 64
  • The actual genetic code used by the cell is a
    triplet.
  • Each triplet is called a codon

74
The Genetic Code
Third letter
75
Translation
  • The sequence of codons is translated to a
    sequence of amino acids
  • Gene -GCT TGT TTA CGA ATT-
  • mRNA -GCU UGU UUA CGA AUU -
  • Peptide - Ala - Cys - Leu - Arg - Ile
  • Start codon AUG
  • Also code Met
  • Stop codon UGA, UAA, UAG

76
Translation
  • Transfer RNA (tRNA) a different type of RNA.
  • Freely float in the cell.
  • Every amino acid has its own type of tRNA that
    binds to it alone.
  • Anti-codon codon binding crucial.

tRNA-Pro
Anti-codon
Nascent peptide
tRNA-Leu
mRNA
77
Transcriptional regulation
Transcription factor
RNA Polymerase
Transcription starting site
gene
promoter
  • Will talk more in later lectures
  • RNA polymerase binds to certain location on
    promoter to initiate transcription
  • Transcription factor binds to specific sequences
    on the promoter to regulate the transcription
  • Recruit RNA polymerase induce
  • Block RNA polymerase repress
  • Multiple transcription factors may coordinate

78
Splicing
Transcription starting site
gene
promoter
transcription
Pre-mRNA
  • Pre-mRNA needs to be edited to form mature mRNA
  • Will talk more in later lectures.

intron
intron
Pre-mRNA
exon
exon
3 UTR
exon
5 UTR
Splicing
Mature mRNA (mRNA)
Open reading frame (ORF)
Start codon
Stop codon
79
Summary
  • DNA a string made from A, C, G, T
  • Forms the basis of genes
  • Has 5 and 3
  • Normally forms double-strand by reverse
    complement
  • RNA a string made from A, C, G, U
  • mRNA messenger RNA
  • tRNA transfer RNA
  • Other types of RNA rRNA, miRNA, etc.
  • Has 5 and 3
  • Normally single-stranded. But can form secondary
    structure
  • Protein made from 20 kinds of amino acids
  • Actual worker in the cell
  • Has N-terminal and C-terminal
  • Sequence uniquely determined by its gene via the
    use of codons
  • Sequence determines structure, structure
    determines function
  • Central dogma DNA transcribes to RNA, RNA
    translates to Protein
  • Both steps are regulated

80
  • Experimental techniques to manipulate DNA

81
DNA synthesis
  • Creating DNA synthetically in a laboratory
  • Chemical synthesis
  • Chemical reactions
  • Arbitrary sequences
  • Maximum length 160-200
  • Cloning make copies based on a DNA template
  • Biological reactions
  • Requires template
  • Many copies of a long DNA in a short time

82
in vivo DNA Cloning
  • Connect a piece of DNA to bacterial DNA, which
    can then be replicated together with the host DNA

bacterial DNA
83
in vitro DNA Cloning
  • Polymerase chain reaction (PCR)

5
5
denature
5
5
Primer (lt 30 bases)
5
5
5
5
DNA Polymerase
dNTP
5
5
5
5
84
Some terms
  • Denature a DNA double-strand is separated into
    two strands
  • By raising temperature
  • Renature the process that two denatured DNA
    strands re-forms a double-strand
  • By cooling down slowly
  • Hybridization two heterogeneous DNAs form a
    double-stranded DNA
  • may have mismatches
  • The rationale behind many molecular biological
    techniques including DNA microarray

85
DNA sequencing technology
  • Read out the letters from a DNA sequence

1974, Frederick Sanger
GTGAGGCGCTGC
86
DNA sequencing Basic idea
  • PCR
  • primer extension
  • 5-TTACAGGTCCATACTA ?
  • 3-AATGTCCAGGTATGATACATAGG-5
  • We need to supply A, C, G, T for the synthesis to
    continue
  • Besides A, C, G, T, we add some A, C, G, and
    T
  • Very similar to ACGT in all aspects, except that
  • The extension will stop if used

87
DNA sequencing, cont
88
DNA sequencing, cont
89
(No Transcript)
90
Advances in DNA sequencing
  • 1969 three years to sequence 115nt DNA
  • 1979 three years to sequence 1650nt
  • 1989 one week to sequence 1650nt
  • 1995 Haemophilus genome sequenced at TIGR -
    1,830,138nt
  • 2000 Human Genome - working draft sequence, 3
    billion bases
  • 2003 (near) completion of human genome

91
The bioinformatics landmark
  • Completion of human genome sequencing is a
    success embraced by
  • Advancement in sequencing technology
  • Speed of computation
  • Algorithm development in bioinformatics
  • HGP (Human Genome Project) strategy
  • Hierarchical sequencing
  • Estimated 15 years (1990 2005), completed in 13
    years
  • 3 billion
  • Celera strategy
  • Whole-genome shotgun sequencing
  • Three years (1998-2001)
  • 300 million

92
Now
  • Over 300 genomes have been sequenced
  • 1011 - 1012 nt

93
2007
  • Genomes of three individual human were sequenced
  • James Watson
  • Craig Venter
  • TBN Chinese
  • Cost for sequencing Watsons genome
  • 3 million, 2 months
  • Compared to 3 billion, 13 years for HGP

94
  • Sequencing speed has been tremendously improved
  • High efficiency and relatively low cost makes it
    possible to sequence the genome of any individual
    from any species
  • Whats next?

95
  • Continue to sequence more species?
  • More individuals?
  • What to do with those sequences?

96
  • Coming next biological sequence analysis
Write a Comment
User Comments (0)
About PowerShow.com