Title: Algorithms in Bioinformatics
1Algorithms in Bioinformatics
Instructor Yao-Ting Huang
Bioinformatics Laboratory, Department of Computer
Science Information Engineering, National Chung
Cheng University.
2Introduction
- The need of bioinformatics
- The rapid growth of data produced by biologists
requires efficient programs to analyze. - Replace wet-lab experiments with dry-lab program
to save time and cost. - Prediction and mining knowledge from the data.
- Introduction to classical problems and algorithms
in bioinformaitcs. - We will mainly focus on problems at DNA and RNA
levels.
3Reference Book
- An Introduction to Bioinformatics Algorithms, by
Neil C. Jones and Pavel A. Pevzner, MIT Press,
2004.
4Reference Book
- Molecular Biology An Algorithmic Approach, by
Pavel A. Pevzner, MIT Press, 2000.
5Grading Policy
- Homework assignments and Class participation
(15) - Two midterm exams (30 each)
- November 14, 2007.
- December 19, 2007 (tentatively)
- Mostly from the lectures.
- Presentation of selected papers (25).
6Selection of papers
- Possible sources (to be announced later)
- Journals
- Bioinformatics, Journal of Computational Biology,
BMC Bioinformatics, Genome Research, Genome
Research, Nature, Nature Genetics - Conferences
- RECOMB, ISMB, PSB, ECCB, WABI, ,
7Teaching Information
- Instructor
- Yao-Ting Huang ythuang_at_cs.ccu.edu.tw
- Office 511
- Office hours 10AM12AM, Thursday.
8Prerequisites
- Basic knowledge on biology is welcome but not
required. - Basic concepts in algorithms
- Basic data structures
- Big-O notations
9Syllabus
- Alignment Algorithms
- Genome Annotation
- Single Nucleotide Polymorphisms
- Copy Number Polymorphisms
- Linkage Disequilibrium and recombination
hotspots - Evolutionary Analysis
- Statistical tests and simulation
- Other Selected Topics
10A Brief History of Genetics
11The Origin of Species
- At the age of 51, Charles Darwin published the
book On the Origin of Species (1859). - Populations evolve over the course of generations
through a process of natural selection.
12Double Helix
- Discovered by Watson and Crick, Nature, 1953.
- 900 words, 2 pages
13Human Genome Project
- 1977
- Maxam and Gilbert and Frederick Sanger
independently develop methods for sequencing DNA. - 1990
- The Human Genome Project (HGP) began and was
expected to take 15 years. - The HGP consortium included geneticists in China,
France, Germany, Japan, and the United Kingdom. - 1998
- A company Celera claims to sequence the human
genome within 3 years for 300 million. - In response, the Wellcome Trust doubles its
support for the HGP to 330 million.
14Human Genome Project
- 2000
- At a White House ceremony, HGP and Celera jointly
announce working drafts of the human genome
sequence, declare their feud at an end. - 2001
- The HGP consortium publishes its draft genome in
Nature (15 February), and Celera publishes its
genome in Science (16 February). - 2003
- The completely human genome is released (2 years
earlier).
15Genome Assembly
- The HGP human genome is assembled hours ahead of
Celera. - Jim Kent wrote the program that assembled the HGP
genome and run it on a grid of PCs with Linux. - His efforts ensured that the human genome were
not patented by Celera.
16Other Sequencing Projects
- 2002 Mouse genome
- 2002 Rice genome
- 2004 Rat genome
- 2005 Chimpanzee genome
- http//www.ensembl.org/index.html
17The Primate Family Tree
Source Nature
18Comparison of Genomes of Different Species
- The comparison of genomes of distinct species can
reveal the evolutionary history. - E.g., which gene is newly derived or presented
long time ago? - The chimpanzee genome differs with human genome
with only 1.
19Functions of the Human Genome
- The ENCyclopedia Of DNA Elements (ENCODE) Project
aims to identify all functional elements in the
human genome sequence (Nature, 2007).
20Discovery of Genetic Variants in Human Populations
- The HapMap project collect 269 samples from
African, European, and Asian descendants to
identify genetic variants among individuals. - 90 Yoruba of African, 90 European descendants, 45
Chinese and 44 Japanese.
21Discovery of Genetic Variants
- The DNA of two individuals differs in less than
0.1. - Hinds et al. identified 1,586,383 Single
Nucleotide Polymorphisms across three human
populations (Science, 2005).
22Perspective of Computer Scientists
- Formulate biological problems into combinatorial
problems - Develop efficient algorithms to solve them
exactly or approximately. - Design simulation methods to validate the
experimental results.
23Introduction to Some Biology Background
24DNA The Code of Life
- Four codes Adenine, Guanine, Thymine, and
Cytosine. - They pair A-T and C-G on complimentary strands.
25Structure of DNA
- DNA has four bases
- Adenine (A), Guanine (G), Cytosine (C), Thymine
(T) - A and G are purines, where C and T are
pyrimidines. - Purines are double ring bases
- Pyrimidines are single ring bases.
N
O
N
O
N
C
C
N
N
C
N
C
O
O
C
C
C
C
N
C
C
C
C
N
N
C
N
C
N
C
C
N
C
N
C
N
N
Cytosine
Thymine
Adenine
Guanine
26DNA A Huge Amount of Information
- The human genome is 3 biilion length.
- Which portion is functional?
AGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCGTA
AGAGCTATGCCATTTATATAAAATTTAAAGGCGCGGGGGGTTAAGAGGCG
CGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAAGC
TATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAA
AATTTAAAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTT
AAAGCGTAAGAGCTATGCCATTTATATACAATCTAAAGTTAAAGCGTAAG
AGCTATGCAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTA
AAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGAGGCGCGGGGGGT
TAAGAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATT
TATATAAAATTTAAAGAAAGCGTAAGAGCTATGCCATTTATATAAAATTT
AAAGGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCG
TAAGAGCTATGCCATTTATATAAAATTTAAAGGGCGCGGGGGGTTAAGAG
CTATGCCATTAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATT
TAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGTATAAAATTTA
AAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCG
TAAGAGCTATGCCATTTATATAAAATTTAAAGAAAGCGTAAGAGCTATGC
CATTTATATAAAATTTAAAGGGCGCGGGGGGTTAAGAGCTATGCCATTTA
TATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGGG
CGCGGGGGGTTAAGAGCTATGCCATTTAAAATTTAAAGGCGCGGGGGGTT
AAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCG
TAAGAAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCAT
TTATATAAAATTTAAAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATA
TAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGTTAA
AGCGTAAGAGCTATGCAGGCGCGGGGAGCTGGGTTTATATAAAATTTA
27Genes in the Human Genome
- A gene is a set of segments of DNA necessary to
produce a functional RNA product. - Genes are only portion of the human genome.
- The gene density of the human genome is roughly
1215 genes/Mb. - It is estimated that the human genome contains
about 20,00025,000 genes.
AGCCTACGAATAACCCCTACGAATACATATG
28Annotation of Human Genes
- People were developing software to annotate the
human genome. - Align protein or RNA sequences to the genome.
- GC-content is positively correlated with gene
length and gene density. - e.g., identification of rich GC-islands is a
computational problem.
29Central Dogma of Molecular Biology
Translation
Transcription
- Transcription
- The DNA codes for the production of messenger RNA
(mRNA). - Translation
- The mRNA is used for protein synthesis.
30Gene
UTR
31Gene
32Alternative Splicing
- The diversity of proteins is due to alternative
splicing.
33Research Issues
- The major research topics years ago mainly focus
on annotation of human genes. - Comparative methods often rely on alignment
algorithms (e.g., BLAST). - The RNA or protein sequences are aligned to the
human genome. - Recently the target has been shifted to discover
events of alternative splicing.
34Translation
- The mRNA is translated into amino acids to form a
protein.
AGCCUACGAAUAACCCCUACGAAUACAUAUGCUACGAAUAACCCCUACG
35Translation
- The amino acids are created based on the codons.
36RNAi
- Nobel Prize 2006, "for their discovery of RNA
interference - gene silencing by double-stranded
RNA"
37Central Dogma of Molecular Biology
38Remarks
- In 2007, the ENCODE project declares that the
human genome is pervasively transcribed (Nature,
2007). - The majority of nucleotides can be found in
primary transcripts, including non-protein-coding
transcripts. - Some other factors might still affect the
expression of genes or protein synthesis. - Post translation modification.
- Epigenomic factors.
39Genetic Variants in the Population
40Genetic Variants
- The genetic variants differ among members in the
human population.
Black eye Brown eye Black eye Blue eye Brown
eye Brown eye
GATATTCGTACGGA-T GATGTTCGTACTGAAT GATATTCGTACGGA-T
GATATTCGTACGGAAT GATGTTCGTACTGAAT GATGTTCGTACTGAA
T
DNASequences of 6 individuals
41Genetic Variants Over Time
Variants observed in a population
Mutations over time
Common Ancestor
time
present
42Point Mutation
- Mutation is caused by chemicals or malfunction of
DNA replication and exchange a single nucleotide
for another. - Transition exchanges a purine for a purine or a
pyrimidine for a pyrimidine. - e.g., C -gt T or A -gt G.
- Transversion exchanges a purine for a pyrimidine
or a pyrimidine for a purine. - e.g., A -gt C or T -gt G
T
A
G
C
43Other Genetic Variants
- Insertion/deletion.
- A portion of DNA is inserted or deleted compared
with the reference genome. - Inversion.
- A portion of DNA is inversed with respect to the
reference genome. - Copy number polymorphisms.
- Duplication and deletion of a segment of DNA.
44Microarray Technology
- Thanks to the advance of array technology, tens
of thousands of genetic variants can be
identified within affordable cost. - Affymetrix 500k SNP array.
- Tiling array.
45Association Study
- The study of genetic variants using cases and
controls can find out the cause of disease.
S1 S2 S3 S4 S5 S6 S7 S8
1 0 1 1 0 x 0 0
Cases
1 1 0 1 1 2 x 0
0 2 1 1 0 0 1 0
0 1 1 0 0 1 0 0
0 0 x 0 1 0 1 1
Controls
0 2 1 0 0 x 0 1
46Recombination
- The chromosome recombination breaks up and
reorganizes the DNA.
47NCBI
- National Center for Biotechnology Information
- http//www.ncbi.nlm.nih.gov/
48Public Database
- UCSC Genome Browser
- http//genome.ucsc.edu/
49Public Database
- Ensembl
- http//www.ensembl.org/