Algorithms in Bioinformatics - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Algorithms in Bioinformatics

Description:

... working drafts of the human genome sequence, declare their feud at an end. ... The Primate Family Tree. Source: Nature. 18. Dept. of Computer Science ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 50
Provided by: yaotin
Category:

less

Transcript and Presenter's Notes

Title: Algorithms in Bioinformatics


1
Algorithms in Bioinformatics
Instructor Yao-Ting Huang
Bioinformatics Laboratory, Department of Computer
Science Information Engineering, National Chung
Cheng University.
2
Introduction
  • The need of bioinformatics
  • The rapid growth of data produced by biologists
    requires efficient programs to analyze.
  • Replace wet-lab experiments with dry-lab program
    to save time and cost.
  • Prediction and mining knowledge from the data.
  • Introduction to classical problems and algorithms
    in bioinformaitcs.
  • We will mainly focus on problems at DNA and RNA
    levels.

3
Reference Book
  • An Introduction to Bioinformatics Algorithms, by
    Neil C. Jones and Pavel A. Pevzner, MIT Press,
    2004.

4
Reference Book
  • Molecular Biology An Algorithmic Approach, by
    Pavel A. Pevzner, MIT Press, 2000.

5
Grading Policy
  • Homework assignments and Class participation
    (15)
  • Two midterm exams (30 each)
  • November 14, 2007.
  • December 19, 2007 (tentatively)
  • Mostly from the lectures.
  • Presentation of selected papers (25).

6
Selection of papers
  • Possible sources (to be announced later)
  • Journals
  • Bioinformatics, Journal of Computational Biology,
    BMC Bioinformatics, Genome Research, Genome
    Research, Nature, Nature Genetics
  • Conferences
  • RECOMB, ISMB, PSB, ECCB, WABI, ,

7
Teaching Information
  • Instructor
  • Yao-Ting Huang ythuang_at_cs.ccu.edu.tw
  • Office 511
  • Office hours 10AM12AM, Thursday.

8
Prerequisites
  • Basic knowledge on biology is welcome but not
    required.
  • Basic concepts in algorithms
  • Basic data structures
  • Big-O notations

9
Syllabus
  • Alignment Algorithms
  • Genome Annotation
  • Single Nucleotide Polymorphisms
  • Copy Number Polymorphisms
  • Linkage Disequilibrium and recombination
    hotspots
  • Evolutionary Analysis
  • Statistical tests and simulation
  • Other Selected Topics

10
A Brief History of Genetics
11
The Origin of Species
  • At the age of 51, Charles Darwin published the
    book On the Origin of Species (1859).
  • Populations evolve over the course of generations
    through a process of natural selection.

12
Double Helix
  • Discovered by Watson and Crick, Nature, 1953.
  • 900 words, 2 pages

13
Human Genome Project
  • 1977
  • Maxam and Gilbert and Frederick Sanger
    independently develop methods for sequencing DNA.
  • 1990
  • The Human Genome Project (HGP) began and was
    expected to take 15 years.
  • The HGP consortium included geneticists in China,
    France, Germany, Japan, and the United Kingdom.
  • 1998
  • A company Celera claims to sequence the human
    genome within 3 years for 300 million.
  • In response, the Wellcome Trust doubles its
    support for the HGP to 330 million.

14
Human Genome Project
  • 2000
  • At a White House ceremony, HGP and Celera jointly
    announce working drafts of the human genome
    sequence, declare their feud at an end.
  • 2001
  • The HGP consortium publishes its draft genome in
    Nature (15 February), and Celera publishes its
    genome in Science (16 February).
  • 2003
  • The completely human genome is released (2 years
    earlier).

15
Genome Assembly
  • The HGP human genome is assembled hours ahead of
    Celera.
  • Jim Kent wrote the program that assembled the HGP
    genome and run it on a grid of PCs with Linux.
  • His efforts ensured that the human genome were
    not patented by Celera.

16
Other Sequencing Projects
  • 2002 Mouse genome
  • 2002 Rice genome
  • 2004 Rat genome
  • 2005 Chimpanzee genome
  • http//www.ensembl.org/index.html

17
The Primate Family Tree
Source Nature
18
Comparison of Genomes of Different Species
  • The comparison of genomes of distinct species can
    reveal the evolutionary history.
  • E.g., which gene is newly derived or presented
    long time ago?
  • The chimpanzee genome differs with human genome
    with only 1.

19
Functions of the Human Genome
  • The ENCyclopedia Of DNA Elements (ENCODE) Project
    aims to identify all functional elements in the
    human genome sequence (Nature, 2007).

20
Discovery of Genetic Variants in Human Populations
  • The HapMap project collect 269 samples from
    African, European, and Asian descendants to
    identify genetic variants among individuals.
  • 90 Yoruba of African, 90 European descendants, 45
    Chinese and 44 Japanese.

21
Discovery of Genetic Variants
  • The DNA of two individuals differs in less than
    0.1.
  • Hinds et al. identified 1,586,383 Single
    Nucleotide Polymorphisms across three human
    populations (Science, 2005).

22
Perspective of Computer Scientists
  • Formulate biological problems into combinatorial
    problems
  • Develop efficient algorithms to solve them
    exactly or approximately.
  • Design simulation methods to validate the
    experimental results.

23
Introduction to Some Biology Background
24
DNA The Code of Life
  • Four codes Adenine, Guanine, Thymine, and
    Cytosine.
  • They pair A-T and C-G on complimentary strands.

25
Structure of DNA
  • DNA has four bases
  • Adenine (A), Guanine (G), Cytosine (C), Thymine
    (T)
  • A and G are purines, where C and T are
    pyrimidines.
  • Purines are double ring bases
  • Pyrimidines are single ring bases.

N
O
N
O
N
C
C
N
N
C
N
C
O
O
C
C
C
C
N
C
C
C
C
N
N
C
N
C
N
C
C
N
C
N
C
N
N
Cytosine
Thymine
Adenine
Guanine
26
DNA A Huge Amount of Information
  • The human genome is 3 biilion length.
  • Which portion is functional?

AGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCGTA
AGAGCTATGCCATTTATATAAAATTTAAAGGCGCGGGGGGTTAAGAGGCG
CGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAAGC
TATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAA
AATTTAAAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTT
AAAGCGTAAGAGCTATGCCATTTATATACAATCTAAAGTTAAAGCGTAAG
AGCTATGCAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTA
AAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGAGGCGCGGGGGGT
TAAGAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATT
TATATAAAATTTAAAGAAAGCGTAAGAGCTATGCCATTTATATAAAATTT
AAAGGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCG
TAAGAGCTATGCCATTTATATAAAATTTAAAGGGCGCGGGGGGTTAAGAG
CTATGCCATTAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATT
TAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGTATAAAATTTA
AAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCG
TAAGAGCTATGCCATTTATATAAAATTTAAAGAAAGCGTAAGAGCTATGC
CATTTATATAAAATTTAAAGGGCGCGGGGGGTTAAGAGCTATGCCATTTA
TATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGGG
CGCGGGGGGTTAAGAGCTATGCCATTTAAAATTTAAAGGCGCGGGGGGTT
AAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCG
TAAGAAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCAT
TTATATAAAATTTAAAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATA
TAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGTTAA
AGCGTAAGAGCTATGCAGGCGCGGGGAGCTGGGTTTATATAAAATTTA
27
Genes in the Human Genome
  • A gene is a set of segments of DNA necessary to
    produce a functional RNA product.
  • Genes are only portion of the human genome.
  • The gene density of the human genome is roughly
    1215 genes/Mb.
  • It is estimated that the human genome contains
    about 20,00025,000 genes.

AGCCTACGAATAACCCCTACGAATACATATG
28
Annotation of Human Genes
  • People were developing software to annotate the
    human genome.
  • Align protein or RNA sequences to the genome.
  • GC-content is positively correlated with gene
    length and gene density.
  • e.g., identification of rich GC-islands is a
    computational problem.

29
Central Dogma of Molecular Biology
Translation
Transcription
  • Transcription
  • The DNA codes for the production of messenger RNA
    (mRNA).
  • Translation
  • The mRNA is used for protein synthesis.

30
Gene
UTR
31
Gene
32
Alternative Splicing
  • The diversity of proteins is due to alternative
    splicing.

33
Research Issues
  • The major research topics years ago mainly focus
    on annotation of human genes.
  • Comparative methods often rely on alignment
    algorithms (e.g., BLAST).
  • The RNA or protein sequences are aligned to the
    human genome.
  • Recently the target has been shifted to discover
    events of alternative splicing.

34
Translation
  • The mRNA is translated into amino acids to form a
    protein.

AGCCUACGAAUAACCCCUACGAAUACAUAUGCUACGAAUAACCCCUACG
35
Translation
  • The amino acids are created based on the codons.

36
RNAi
  • Nobel Prize 2006, "for their discovery of RNA
    interference - gene silencing by double-stranded
    RNA"

37
Central Dogma of Molecular Biology
38
Remarks
  • In 2007, the ENCODE project declares that the
    human genome is pervasively transcribed (Nature,
    2007).
  • The majority of nucleotides can be found in
    primary transcripts, including non-protein-coding
    transcripts.
  • Some other factors might still affect the
    expression of genes or protein synthesis.
  • Post translation modification.
  • Epigenomic factors.

39
Genetic Variants in the Population
40
Genetic Variants
  • The genetic variants differ among members in the
    human population.

Black eye Brown eye Black eye Blue eye Brown
eye Brown eye
GATATTCGTACGGA-T GATGTTCGTACTGAAT GATATTCGTACGGA-T
GATATTCGTACGGAAT GATGTTCGTACTGAAT GATGTTCGTACTGAA
T
DNASequences of 6 individuals
41
Genetic Variants Over Time
Variants observed in a population
Mutations over time
Common Ancestor
time
present
42
Point Mutation
  • Mutation is caused by chemicals or malfunction of
    DNA replication and exchange a single nucleotide
    for another.
  • Transition exchanges a purine for a purine or a
    pyrimidine for a pyrimidine.
  • e.g., C -gt T or A -gt G.
  • Transversion exchanges a purine for a pyrimidine
    or a pyrimidine for a purine.
  • e.g., A -gt C or T -gt G

T
A
G
C
43
Other Genetic Variants
  • Insertion/deletion.
  • A portion of DNA is inserted or deleted compared
    with the reference genome.
  • Inversion.
  • A portion of DNA is inversed with respect to the
    reference genome.
  • Copy number polymorphisms.
  • Duplication and deletion of a segment of DNA.

44
Microarray Technology
  • Thanks to the advance of array technology, tens
    of thousands of genetic variants can be
    identified within affordable cost.
  • Affymetrix 500k SNP array.
  • Tiling array.

45
Association Study
  • The study of genetic variants using cases and
    controls can find out the cause of disease.

S1 S2 S3 S4 S5 S6 S7 S8
1 0 1 1 0 x 0 0
Cases
1 1 0 1 1 2 x 0
0 2 1 1 0 0 1 0
0 1 1 0 0 1 0 0
0 0 x 0 1 0 1 1
Controls
0 2 1 0 0 x 0 1
46
Recombination
  • The chromosome recombination breaks up and
    reorganizes the DNA.

47
NCBI
  • National Center for Biotechnology Information
  • http//www.ncbi.nlm.nih.gov/

48
Public Database
  • UCSC Genome Browser
  • http//genome.ucsc.edu/

49
Public Database
  • Ensembl
  • http//www.ensembl.org/
Write a Comment
User Comments (0)
About PowerShow.com