CS 5263 Bioinformatics

About This Presentation

Title:

CS 5263 Bioinformatics

Description:

CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology Outline Administravia What is bioinformatics Why bioinformatics Course ... – PowerPoint PPT presentation

Number of Views:216

Avg rating:3.0/5.0

Slides: 97

Provided by: jru49

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics

1
CS 5263 Bioinformatics

Lectures 1 2 Introduction to Bioinformatics
and Molecular Biology

2
Outline

Administravia
What is bioinformatics
Why bioinformatics
Course overview
Short introduction to molecular biology

3
Survey form

Your name
Email
Academic preparation
Interests
help me better design lectures and assignments

4
Course Info

Instructor Jianhua Ruan
Office S.B. 4.01.48
Phone 458-6819
Email jruan_at_cs.utsa.edu
Office hours MW 2-3pm
Web http//www.cs.utsa.edu/jruan/teaching/cs5263
_fall_2008/

5
Course description

A survey of algorithms and methods in
bioinformatics, approached from a computational
viewpoint.
Prerequisite
Programming experiences
Some knowledge in algorithms and data structures
Basic understanding of statistics and probability
Appetite to learn some biology

6
Textbooks

An Introduction to Bioinformatics Algorithms
by Jones and Pevzner
Biological Sequence Analysis Probabilistic
Models of Proteins and Nucleic Acids
by Durbin, Eddy, Krogh and Mitchison
Additional resources
Papers
Handouts
See course website

7
Grading

Attendance 10
At most 2 classes missed without affecting grade
Homeworks 50
About 5 assignments
Combination of theoretical and programming
exercises
No exams
No late submission accepted
Read the collaboration policy!
Final project and presentation 40

8
Why bioinformatics

The advance of experimental technology has
generated huge amount of data
The human genome is finished
Even if it were, thats only the beginning
The bottleneck is how to integrate and analyze
the data
Noisy
Diverse

9
Growth of GenBank vs Moores law
10
Genome annotations
Meyer, Trends and Tools in Bioinfo and Compt Bio,
2006
11
What is bioinformatics

National Institutes of Health (NIH)
Research, development, or application of
computational tools and approaches for expanding
the use of biological, medical, behavioral or
health data, including those to acquire, store,
organize, archive, analyze, or visualize such
data.

12
What is bioinformatics

National Center for Biotechnology Information
(NCBI)
the field of science in which biology, computer
science, and information technology merge to form
a single discipline. The ultimate goal of the
field is to enable the discovery of new
biological insights as well as to create a global
perspective from which unifying principles in
biology can be discerned.

13
What is bioinformatics

Wikipedia
Bioinformatics refers to the creation and
advancement of algorithms, computational and
statistical techniques, and theory to solve
formal and practical problems posed by or
inspired from the management and analysis of
biological data.

14
(No Transcript)
15
Course objectives

Learn the basis of sequence analysis and other
computational biology algorithms
Familiarize with the research topics in
bioinformatics
Be able to
Read / criticize bioinformatics research articles
Identify subareas that best suit your background
Communicate and exchange ideas with
(computational) biologists

16
What you will learn?

Basic concepts in molecular biology and genetics
Algorithms to address selected problems in
bioinformatics
Dynamic programming, string algorithms, graph
algorithms
Statistical learning algorithms HMM, EM, Gibbs
sampling
Data mining clustering / classification
Applications to real data

17
What you will not learn?

Designing / performing biological experiments
(duh!)
Programming (in perl, etc).
Building bioinformatics software tools (GUI,
database, Web, )
Using existing tools / databases (well, not
exactly true)

18
Covered topics
1 week

Biology
Sequence analysis
Sequence alignment
Pairwise, multiple, global, local, optimal,
heuristic
String matching
Motif finding
Gene prediction
RNA structure prediction
Phylogenetic tree
Functional Genomics
Microarray data analysis
Biological networks

8 weeks
5 weeks
19
Computer Scientists vs Biologists(courtesy
Serafim Batzoglou, Stanford)
20
Biologists vs computer scientists

(almost) Everything is true or false in computer
science
(almost) Nothing is ever true or false in Biology

21
Biologists vs computer scientists

Biologists seek to understand the complicated,
messy natural world
Computer scientists strive to build their own
clean and organized virtual world

22
Biologists vs computer scientists

Computer scientists are obsessed with being the
first to invent or prove something
Biologists are obsessed with being the first to
discover something

Some examples of central role of CS in
bioinformatics

24
1. Genome sequencing
3x109 nucleotides
25
1. Genome sequencing
3x109 nucleotides
A big puzzle 60 million pieces
Computational Fragment Assembly Introduced
1980 1995 assemble up to 1,000,000 long DNA
pieces 2000 assemble whole human genome
26
2. Gene Finding
Where are the genes?
In humans 22,000 genes 1.5 of human DNA
27
2. Gene Finding
Hidden Markov Models (Well studied for many years
in speech recognition)
28
3. Protein Folding

The amino-acid sequence of a protein determines
the 3D fold
The 3D fold of a protein determines its function
Can we predict 3D fold of a protein given its
amino-acid sequence?
Holy grail of compbio40 years old problem
Molecular dynamics, computational geometry,
machine learning

29
4. Sequence ComparisonAlignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCG
GTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
x

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Sequence Alignment Introduced 1970 BLAST
1990, most cited paper in history Still very
active area of research
BLAST
Efficient string matching algorithms Fast
database index techniques
30
Lipman Pearson, 1985
, comparison of a 200-amino-acid sequence to the
500,000 residues in the National Biomedical
Research Foundation library would take less than
2 minutes on a minicomputer, and less than 10
minutes on a microcomputer (IBM PC).
Database size today 1012 (increased by 2 million
folds). BLAST search 1.5 minutes
31
5. Microarray analysisClinical prediction of
Leukemia type

2 types
Acute lymphoid (ALL)
Acute myeloid (AML)
Different treatments outcomes
Predict type before treatment?

Bone marrow samples ALL vs AML
Measure amount of each gene
32
Some goals of biology for the next 50 years

List all molecular parts that build an organism
Genes, proteins, other functional parts
Understand the function of each part
Understand how parts interact physically and
functionally
Study how function has evolved across all species
Find genetic defects that cause diseases
Design drugs rationally
Sequence the genome of every human, use it for
personalized medicine
Bioinformatics is an essential component for all
the goals above

A short introduction to molecular biology

34
Life

Two categories
Prokaryotes (e.g. bacteria)
Unicellular
No nucleus
Eukaryotes (e.g. fungi, plant, animal)
Unicellular or multicellular
Has nucleus

35
Prokaryote vs Eukaryote

Eukaryote has many membrane-bounded compartment
inside the cell
Different biological processes occur at different
cellular location

36
Organism, Organ, Cell
Organism
37
Chemical contents of cell

Water
Macromolecules (polymers) - strings made by
linking monomers from a specified set (alphabet)
Protein
DNA
RNA
Small molecules
Sugar
Ions (Na, Ka, Ca2, Cl- ,)
Hormone

38
DNA

DNA forms the genetic material of all living
organisms
Can be replicated and passed to descendents
Contains information to produce proteins
To computer scientists, DNA is a string made from
alphabet A, C, G, T
e.g. ACAGAACGTAGTGCCGTGAGCG
Each letter is a nucleotide
Length varies from hundreds to billions

39
RNA

Historically thought to be information carrier
only
DNA gt RNA gt Protein
New roles have been found for them
To computer scientists, RNA is a string made from
alphabet A, C, G, U
e.g. ACAGAACGUAGUGCCGUGAGCG
Each letter is a nucleotide
Length varies from tens to thousands

40
Protein

Protein the actual worker for almost all
processes in the cell
Enzymes speed up reactions
Signaling information transduction
Structural support
Production of other macromolecules
Transport
To computer scientists, protein is a string made
from 20 kinds of characters
E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP
Each letter is called an amino acid
Length varies from tens to thousands

41
DNA/RNA zoom-in

Commonly referred to as Nucleic Acid
DNA Deoxyribonucleic acid
RNA Ribonucleic acid
Found mainly in the nucleus of a cell (hence
nucleic)
Contain phosphoric acid as a component (hence
acid)
They are made up of a string of nucleotides

42
Nucleotides

A nucleotide has 3 components
Sugar ring (ribose in RNA, deoxyribose in DNA)
Phosphoric acid
Nitrogen base
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T) or Uracil (U)

43
Monomers of RNA ribo-nucleotide

A ribonucleotide has 3 components
Sugar - Ribose
Phosphate group
Nitrogen base
Adenine (A)
Guanine (G)
Cytosine (C)
Uracil (U)

44
Monomers of DNA deoxy-ribo-nucleotide

A deoxyribonucleotide has 3 components
Sugar Deoxy-ribose
Phosphate group
Nitrogen base
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T)

45
Polymerization Nucleotides gt nucleic acids
46
5
Free phosphate
5 prime
3 prime
5-AGCGACTG-3
AGCGACTG
DNA
Often recorded from 5 to 3, which is the
direction of many biological processes. e.g. DNA
replication, transcription, etc.
Base
5
Phosphate
Sugar
1
4
2
3
3
47
5
Free phosphate
5 prime
3 prime
5-AGUGACUG-3
AGUGACUG
RNA
Often recorded from 5 to 3, which is the
direction of many biological processes. e.g.
translation.
3
48
3
5
Base-pair A T G C
Forward () strand
5-AGCGACTG-3 3-TCGCTGAC-5
Backward (-) strand
AGCGACTG TCGCTGAC
One strand is said to be reverse- complementary
to the other
5
3
DNA usually exists in pairs.
49
DNA double helix
G-C pair is stronger than A-T pair
50
Reverse-complementary sequences

5-ACGTTACAGTA-3
The reverse complement is
3-TGCAATGTCAT-5
gt
5-TACTGTAACGT-3
Or simply written as
TACTGTAACGT

51
Orientation of the double helix

Double helix is anti-parallel
5 end of each strand pairs with 3 end of the
other
5 to 3 motion in one strand is 3 to 5 in the
other
Double helix has no orientation
Biology has no forward and reverse strand
Relative to any single strand, there is a
reverse complement or reverse strand
Information can be encoded by either strand or
both strands
5TTTTACAGGACCATG 3
3AAAATGTCCTGGTAC 5

52
RNA

RNAs are normally single-stranded
Form complex structure by self-base-pairing
AU, CG
Can also form RNA-DNA and RNA-RNA double strands.
AT/U, CG

53
Protein zoom-in

Protein is the actual worker for almost all
processes in the cell
A string built from 20 letters
E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH
Each letter is called an amino acid
R
H2N--C--COOH
H

Side chain
Generic chemical form of amino acid
54
Amino acid

20 amino acids, only differ at side chains
Each can be expressed by three letters
Or a single letter A-Y, except B, J, O, U, X, Z
Alanine Ala A
Histidine His H

55
Amino acids gt peptide
R
R
H2N--C--COOH
H2N--C--COOH

H H
R R

H2N--C--CO--NH--C--COOH

H H

Peptide bond
56
Protein

Has orientations
Usually recorded from N-terminal to C-terminal
Peptide vs protein basically the same thing
Conventions
Peptide is shorter (lt 50aa), while protein is
longer
Peptide refers to the sequence, while protein has
2D/3D structure

57
Protein structure

Linear sequence of amino acids folds to form a
complex 3-D structure.
The structure of a protein is intimately
connected to its function.

58
Genome and chromosome

Genome the complete DNA sequences in the cell of
an organism
May contain one (in most prokaryotes) or more (in
eukaryotes) chromosomes
Chromosome a single large DNA molecule in the
cell
May be circular or linear
Contain genes as well as junk DNAs
Highly packed!

59
Formation of chromosome
60
Formation of chromosome
50,000 times shorter than extended DNA
The total length of DNA present in one adult
human is the equivalent of nearly 70 round trips
from the earth to the sun
61
Gene

Gene unit of heredity in living organisms
A segment of DNA with information to make a
protein

62
Some statistics
Chromosomes Bases Genes
Human 46 3 billion 20k-25k
Dog 78 2.4 billion 20k
Corn 20 2.5 billion 50-60k
Yeast 16 20 million 7k
E. coli 1 4 million 4k
Marbled lungfish ? 130 billion ?
63
Human genome

46 chromosomes 22 pairs X Y
1 from mother, 1 from father
Female X X
Male X Y

64
Human genome

Every cell contains the same genomic information
Except sperms and eggs, which only contain half
of the genome
Otherwise your children would have 46 46
chromosomes

65
Cell division mitosis

A cell duplicates its genome and divides into two
identical cells
These cells build up different parts of your body

66
Cell division meiosis

A reproductive cell divides into four cells, each
containing only half of the genomes
Diploid gt haploid
Two haploid cells (sperm egg) forms a zygote
Which will then develop into a multi-cellular
organism by mitosis

67
Central dogma of molecular biology
DNA replication is critical in both mitosis and
meiosis
68
DNA Replication

The process of copying a double-stranded DNA
molecule
Semi-conservative
5-ACATGATAA-3
3-TGTACTATT-5
?
5-ACATGATAA-3 5-ACATGATAA-3
3-TGTACTATT-5 3-TGTACTATT-5

Mutation changes in DNA base-pairs
Proofreading and error-correcting mechanisms
exist to ensure extremely high fidelity

70
Central dogma of molecular biology
71
Transcription

The process that a DNA sequence is copied to
produce a complementary RNA
Called message RNA (mRNA) if the RNA carries
instruction on how to make a protein
Called non-coding RNA if the RNA does not carry
instruction on how to make a protein
Only consider mRNA for now
Similar to replication, but
Only one strand is copied

72
Transcription
(where genetic information is stored)
DNA-RNA pair AU, CG TA, GC
(for making mRNA)
Coding strand 5-ACGTAGACGTATAGAGCCTAG-3 Tem
plate strand 3-TGCATCTGCATATCTCGGATC-5 mRNA
5-ACGUAGACGUAUAGAGCCUAG-3
Coding strand and mRNA have the same sequence,
except that Ts in DNA are replaced by Us in
mRNA.
73
Translation

The process of making proteins from mRNA
A gene uniquely encodes a protein
There are four bases in DNA (A, C, G, T), and
four in RNA (A, C, G, U), but 20 amino acids in
protein
How many nucleotides are required to encode an
amino acid in order to ensure correct
translation?
41 4
42 16
43 64
The actual genetic code used by the cell is a
triplet.
Each triplet is called a codon

74
The Genetic Code
Third letter
75
Translation

The sequence of codons is translated to a
sequence of amino acids
Gene -GCT TGT TTA CGA ATT-
mRNA -GCU UGU UUA CGA AUU -
Peptide - Ala - Cys - Leu - Arg - Ile
Start codon AUG
Also code Met
Stop codon UGA, UAA, UAG

76
Translation

Transfer RNA (tRNA) a different type of RNA.
Freely float in the cell.
Every amino acid has its own type of tRNA that
binds to it alone.
Anti-codon codon binding crucial.

tRNA-Pro
Anti-codon
Nascent peptide
tRNA-Leu
mRNA
77
Transcriptional regulation
Transcription factor
RNA Polymerase
Transcription starting site
gene
promoter

Will talk more in later lectures
RNA polymerase binds to certain location on
promoter to initiate transcription
Transcription factor binds to specific sequences
on the promoter to regulate the transcription
Recruit RNA polymerase induce
Block RNA polymerase repress
Multiple transcription factors may coordinate

78
Splicing
Transcription starting site
gene
promoter
transcription
Pre-mRNA

Pre-mRNA needs to be edited to form mature mRNA
Will talk more in later lectures.

intron
intron
Pre-mRNA
exon
exon
3 UTR
exon
5 UTR
Splicing
Mature mRNA (mRNA)
Open reading frame (ORF)
Start codon
Stop codon
79
Summary

DNA a string made from A, C, G, T
Forms the basis of genes
Has 5 and 3
Normally forms double-strand by reverse
complement
RNA a string made from A, C, G, U
mRNA messenger RNA
tRNA transfer RNA
Other types of RNA rRNA, miRNA, etc.
Has 5 and 3
Normally single-stranded. But can form secondary
structure
Protein made from 20 kinds of amino acids
Actual worker in the cell
Has N-terminal and C-terminal
Sequence uniquely determined by its gene via the
use of codons
Sequence determines structure, structure
determines function
Central dogma DNA transcribes to RNA, RNA
translates to Protein
Both steps are regulated

Experimental techniques to manipulate DNA

81
DNA synthesis

Creating DNA synthetically in a laboratory
Chemical synthesis
Chemical reactions
Arbitrary sequences
Maximum length 160-200
Cloning make copies based on a DNA template
Biological reactions
Requires template
Many copies of a long DNA in a short time

82
in vivo DNA Cloning

Connect a piece of DNA to bacterial DNA, which
can then be replicated together with the host DNA

bacterial DNA
83
in vitro DNA Cloning

Polymerase chain reaction (PCR)

5
5
denature
5
5
Primer (lt 30 bases)
5
5
5
5
DNA Polymerase
dNTP
5
5
5
5
84
Some terms

Denature a DNA double-strand is separated into
two strands
By raising temperature
Renature the process that two denatured DNA
strands re-forms a double-strand
By cooling down slowly
Hybridization two heterogeneous DNAs form a
double-stranded DNA
may have mismatches
The rationale behind many molecular biological
techniques including DNA microarray

85
DNA sequencing technology

Read out the letters from a DNA sequence

1974, Frederick Sanger
GTGAGGCGCTGC
86
DNA sequencing Basic idea

PCR
primer extension
5-TTACAGGTCCATACTA ?
3-AATGTCCAGGTATGATACATAGG-5
We need to supply A, C, G, T for the synthesis to
continue
Besides A, C, G, T, we add some A, C, G, and
T
Very similar to ACGT in all aspects, except that
The extension will stop if used

87
DNA sequencing, cont
88
DNA sequencing, cont
89
(No Transcript)
90
Advances in DNA sequencing

1969 three years to sequence 115nt DNA
1979 three years to sequence 1650nt
1989 one week to sequence 1650nt
1995 Haemophilus genome sequenced at TIGR -
1,830,138nt
2000 Human Genome - working draft sequence, 3
billion bases
2003 (near) completion of human genome

91
The bioinformatics landmark

Completion of human genome sequencing is a
success embraced by
Advancement in sequencing technology
Speed of computation
Algorithm development in bioinformatics
HGP (Human Genome Project) strategy
Hierarchical sequencing
Estimated 15 years (1990 2005), completed in 13
years
3 billion
Celera strategy
Whole-genome shotgun sequencing
Three years (1998-2001)
300 million

92
Now

Over 300 genomes have been sequenced
1011 - 1012 nt

93
2007

Genomes of three individual human were sequenced
James Watson
Craig Venter
TBN Chinese
Cost for sequencing Watsons genome
3 million, 2 months
Compared to 3 billion, 13 years for HGP

Sequencing speed has been tremendously improved
High efficiency and relatively low cost makes it
possible to sequence the genome of any individual
from any species
Whats next?

CS 5263 Bioinformatics - PowerPoint PPT Presentation

CS 5263 Bioinformatics

CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology Outline Administravia What is bioinformatics Why bioinformatics Course ... – PowerPoint PPT presentation