Title: Introduction to Bioinformatics
1Introduction to Bioinformatics
260.602.01 September 1, 2006
Jonathan Pevsner, Ph.D. pevsner_at_kennedykrieger.org
2Teaching assistants
Hugh Cahill (hugh_at_jhu.edu) Jennifer Turney
(jturney_at_jhsph.edu) Meg Zupancic
(mzupanc1_at_jhmi.edu)
3Who is taking this course?
- People with very diverse backgrounds in biology
- People with diverse backgrounds in computer
- science and biostatistics
- Most people have a favorite gene, protein, or
disease
4What are the goals of the course?
- To provide an introduction to bioinformatics
with - a focus on the National Center for
Biotechnology - Information (NCBI) and EBI
- To focus on the analysis of DNA, RNA and
proteins - To introduce you to the analysis of genomes
- To combine theory and practice to help you
- solve research problems
5Themes throughout the course
Textbooks Web sites Literature
references Gene/protein families Computer labs
6Textbook
The course textbook is J. Pevsner, Bioinformatics
and Functional Genomics (Wiley, 2003). The
chapters contain content, lab exercises, and
quizzes that were developed in this course over
the past six years. A few copies will be
available on reserve at Welch Library for those
of you who do not want to buy a copy (go up to
the 2nd floor), and the library has six more
copies. Several other bioinformatics texts are
available Baxevanis and Ouellette David
Mount Durbin et al.
7Web sites
The course website is reached via http//pevsner
lab.kennedykrieger.org/bioinfo_course.htm (or
Google pevsnerlab ? courses) This site
contains the powerpoints for each lecture. The
textbook website is http//www.bioinfbook.org T
his has 1000 URLs, organized by chapter This
site also contains the same powerpoints. The
weekly quizzes are on my website
http//pevsnerlab.kennedykrieger.org/moodle Onc
e you log in and take a quiz, you will get
instant feedback. You can use moodle to ask
questions as well.
8Literature references
You are encouraged to read original
source articles. They will enhance your
understanding of the material. Reading will be
assigned.
9Themes throughout the course gene/protein
families
We will use retinol-binding protein 4 (RBP4) as a
model gene/protein throughout the course. RBP4 is
a member of the lipocalin family. It is a small,
abundant carrier protein. We will study it in a
variety of contexts including --sequence
alignment --gene expression --protein
structure --phylogeny --homologs in various
species We will also use other examples, such as
the globins and the pol protein of HIV-1
10(No Transcript)
11The HIV-1 pol gene encodes three proteins
Aspartyl protease
Reverse transcriptase
Integrase
PR
RT
IN
12Themes throughout the course computer labs
There is a computer lab each Friday. This is a
chance to gain practical experience using a
variety of web resources. You can do the lab on
your own, ahead of time. However, during the
Friday lab you can get help on problems, and in
some cases the computers will have specialized
software.
13Grading
40 ten moodle quizzes (corresponding to chapters
2-11) 30 final exam October 25 (in class) 30
discovery of a novel gene --Find the novel gene
by the end of September, and turn in the final
report, with phylogenetic tree, by October
25 --Instructions are posted on the course
website --We will discuss this project in detail
in the next two weeks.
14Grading
Quizzes are taken at the moodle website, and are
due one week after the relevant lecture 4
Chapter 2 quiz (sequences) 4 Chapter 3 quiz
(alignment) 4 Chapter 4 quiz (BLAST) 4 Chapter
5 quiz (advanced BLAST) 4 Chapter 6 quiz
(RNA) 4 Chapter 7 quiz (microarrays) 4 Chapter
8 quiz (proteomics) 4 Chapter 9 quiz (protein
structure) 4 Chapter 10 quiz (multiple
alignment) 4 Chapter 11 quiz (phylogeny) 30
find-a-gene project (due October 25) 30 final
exam October 25 (in class)
ten quizzes
15Outline for today (chapters 1 and 2)
Definition of bioinformatics Overview of the
NCBI website Accessing information about DNA and
proteins --Definition of an accession
number --Four ways to find information on
proteins and DNA Access to biomedical literature
16What is bioinformatics?
- Interface of biology and computers
- Analysis of proteins, genes and genomes
- using computer algorithms and
- computer databases
- Genomics is the analysis of genomes.
- The tools of bioinformatics are used to make
- sense of the billions of base pairs of DNA
- that are sequenced by genomics projects.
-
17Top ten challenges for bioinformatics
1 Precise models of where and when
transcription will occur in a genome
(initiation and termination) 2 Precise,
predictive models of alternative RNA
splicing 3 Precise models of signal
transduction pathways ability to predict
cellular responses to external stimuli 4
Determining proteinDNA, proteinRNA,
proteinprotein recognition codes 5
Accurate ab initio protein structure prediction
18Top ten challenges for bioinformatics
6 Rational design of small molecule inhibitors
of proteins 7 Mechanistic understanding of
protein evolution 8 Mechanistic understanding
of speciation 9 Development of effective gene
ontologies systematic ways to describe
gene and protein function 10 Education
development of bioinformatics curricula
Source Ewan Birney, Chris Burge, Jim Fickett
19On bioinformatics
Science is about building causal relations
between natural phenomena (for instance, between
a mutation in a gene and a disease). The
development of instruments to increase our
capacity to observe natural phenomena has,
therefore, played a crucial role in the
development of science - the microscope being the
paradigmatic example in biology. With the human
genome, the natural world takes an unprecedented
turn it is better described as a sequence of
symbols. Besides high-throughput machines such as
sequencers and DNA chip readers, the computer and
the associated software becomes the instrument to
observe it, and the discipline of bioinformatics
flourishes.
20On bioinformatics
However, as the separation between us (the
observers) and the phenomena observed increases
(from organism to cell to genome, for instance),
instruments may capture phenomena only
indirectly, through the footprints they leave.
Instruments therefore need to be calibrated the
distance between the reality and the observation
(through the instrument) needs to be accounted
for. This issue of Genome Biology is about
calibrating instruments to observe gene
sequences more specifically, computer programs
to identify human genes in the sequence of the
human genome. Martin Reese and Roderic Guigó,
Genome Biology 2006 7(Suppl I)S1, introducing
EGASP, the Encyclopedia of DNA Elements (ENCODE)
Genome Annotation Assessment Project
21Tool-users
Tool-makers
22Three perspectives on bioinformatics
The cell The organism The tree of life
Page 4
23(No Transcript)
24DNA
RNA
phenotype
protein
Page 5
25Time of development
Body region, physiology, pharmacology, pathology
Page 5
26After Pace NR (1997) Science 276734
Page 6
27DNA
RNA
phenotype
protein
28Growth of GenBank
Base pairs of DNA (billions)
Sequences (millions)
Fig. 2.1 Page 17
1982
1986
1990
1994
1998
2002
Updated 8-12-04 gt40b base pairs
Year
29Growth of GenBank
70
60
50
Base pairs of DNA (billions)
40
Sequences (millions)
30
20
10
0
1985
2000
1995
1990
December 1982
June 2006
30Growth of the International Nucleotide Sequence
Database Collaboration
Base pairs of DNA (billions)
Base pairs contributed by GenBank EMBL
DDBJ
http//www.ncbi.nlm.nih.gov/Genbank/
31Central dogma of molecular biology
DNA
RNA
protein
32DNA
RNA
phenotype
protein
protein sequence databases
cDNA ESTs UniGene
genomic DNA databases
Fig. 2.2 Page 20
33There are three major public DNA databases
GenBank
EMBL
DDBJ
The underlying raw DNA sequences are identical
Page 16
34There are three major public DNA databases
GenBank
EMBL
DDBJ
Housed at EBI European Bioinformatics Institute
Housed at NCBI National Center
for Biotechnology Information
Housed in Japan
Page 16
35gt100,000 species are represented in GenBank
all species 128,941 viruses 6,137 bacteria 31,
262 archaea 2,100 eukaryota 87,147
Table 2-1 Page 17
36Taxonomy nodes at NCBI
http//www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
8/06
37The most sequenced organisms in GenBank
Homo sapiens 10.7 billion bases Mus musculus
6.5b Rattus norvegicus 5.6b Danio rerio
1.7b Zea mays 1.4b Oryza sativa
0.8b Drosophila melanogaster 0.7b Gallus
gallus 0.5b Arabidopsis thaliana 0.5b
Table 2-2 Page 18
Updated 8-12-04 GenBank release 142.0
38The most sequenced organisms in GenBank
Homo sapiens 11.2 billion bases Mus musculus
7.5b Rattus norvegicus 5.7b Danio rerio
2.1b Bos taurus 1.9b Zea mays
1.4b Oryza sativa (japonica) 1.2b Xenopus
tropicalis 0.9b Canis familiaris 0.8b Drosophi
la melanogaster 0.7b
Table 2-2 Page 18
Updated 8-29-05 GenBank release 149.0
39The most sequenced organisms in GenBank
Homo sapiens 12.3 billion bases Mus musculus
8.0b Rattus norvegicus 5.7b Bos
taurus 3.5b Danio rerio 2.5b Zea mays
1.8b Oryza sativa (japonica)
1.5b Strongylocentrotus purpurata 1.2b Sus
scrofa 1.0b Xenopus tropicalis 1.0b
Table 2-2 Page 18
Updated 7-19-06 GenBank release 154.0
40National Center for Biotechnology Information
(NCBI) www.ncbi.nlm.nih.gov
Page 24
41Fig. 2.5 Page 25
www.ncbi.nlm.nih.gov
42Fig. 2.5 Page 25
43- PubMed is
-
- National Library of Medicine's search service
- 16 million citations in MEDLINE
- links to participating online journals
- PubMed tutorial (via Education on side bar)
Page 24
44- Entrez integrates
- the scientific literature
- DNA and protein sequence databases
- 3D protein structure data
- population study data sets
- assemblies of complete genomes
Page 24
45Entrez is a search and retrieval system that
integrates NCBI databases
Page 24
46- BLAST is
- Basic Local Alignment Search Tool
- NCBI's sequence similarity search tool
- supports analysis of DNA and protein databases
- 100,000 searches per day
Page 25
47- OMIM is
- Online Mendelian Inheritance in Man
- catalog of human genes and genetic disorders
- edited by Dr. Victor McKusick, others at JHU
Page 25
48- Books is
- searchable resource of on-line books
Page 26
49- TaxBrowser is
- browser for the major divisions of living
organisms - (archaea, bacteria, eukaryota, viruses)
- taxonomy information such as genetic codes
- molecular data on extinct organisms
Page 26
50- Structure site includes
- Molecular Modelling Database (MMDB)
- biopolymer structures obtained from
- the Protein Data Bank (PDB)
- Cn3D (a 3D-structure viewer)
- vector alignment search tool (VAST)
Page 26
51Accessing information on molecular sequences
Page 26
52Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that
contain information on DNA, RNA, or protein
sequences. You may want to acquire information
beginning with a query such as the name of a
protein of interest, or the raw nucleotides
comprising a DNA sequence of interest. DNA
sequences and other molecular data are tagged
with accession numbers that are used to identify
a sequence or other record relevant to molecular
data.
Page 26
53What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
Page 27
54Four ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
Note LocusLink at NCBI was recently retired. The
third printing of the book has updated these
sections (pages 27-31).
Page 27
554 ways to access protein and DNA sequences
1 Entrez Gene with RefSeq Entrez Gene is a
great starting point it collects key information
on each gene/protein from major databases. It
covers all major organisms. RefSeq provides a
curated, optimal accession number for each DNA
(NM_006744) or protein (NP_007635)
Page 27
56From the NCBI home page, type rbp4 and hit Go
revised Fig. 2.7 Page 29
57revised Fig. 2.7 Page 29
58(No Transcript)
59(No Transcript)
60By applying limits, there are now just two entries
61Entrez Gene (top of page)
Note that links to many other RBP4 database
entries are available
revised Fig. 2.8 Page 30
62Entrez Gene (middle of page)
63Entrez Gene (bottom of page)
64Fig. 2.9 Page 32
65Fig. 2.9 Page 32
66Fig. 2.9 Page 32
67FASTA format
Fig. 2.10 Page 32
68What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
Page 27
69NCBIs important RefSeq project best
representative sequences
RefSeq (accessible via the main page of
NCBI) provides an expertly curated accession
number that corresponds to the most stable,
agreed-upon reference version of a sequence.
RefSeq identifiers include the following
formats Complete genome NC_ Complete
chromosome NC_ Genomic contig NT_ mRN
A (DNA format) NM_ e.g. NM_006744 Protein
NP_ e.g. NP_006735
Page 29-30
70NCBIs RefSeq project accession for genomic,
mRNA, protein sequences
Accession Molecule Method Note AC_123456
Genomic Mixed Alternate complete
genomic AP_123456 Protein Mixed Protein
products alternate NC_123456
Genomic Mixed Complete genomic
molecules NG_123456 Genomic Mixed Incomplet
e genomic regions NM_123456
mRNA Mixed Transcript products mRNA
NM_123456789 mRNA Mixed Transcript
products 9-digit NP_123456
Protein Mixed Protein products NP_123456789
Protein Curation Protein products 9-digit
NR_123456 RNA Mixed Non-coding
transcripts NT_123456 Genomic Automated Gen
omic assemblies NW_123456
Genomic Automated Genomic assemblies
NZ_ABCD12345678 Genomic Automated Whole genome
shotgun data XM_123456 mRNA Automated Transc
ript products XP_123456 Protein Automated Pr
otein products XR_123456 RNA Automated Tran
script products YP_123456 Protein Auto.
Curated Protein products ZP_12345678
Protein Automated Protein products
71Four ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
Page 31
72protein
DNA
RNA
complementary DNA (cDNA)
UniGene
Fig. 2.3 Page 23
73UniGene unique genes via ESTs
- Find UniGene at NCBI
- www.ncbi.nlm.nih.gov/UniGene
- UniGene clusters contain many expressed sequence
- tags (ESTs), which are DNA sequences (typically
- 500 base pairs in length) corresponding to the
mRNA - from an expressed gene. ESTs are sequenced from
a - complementary DNA (cDNA) library.
- UniGene data come from many cDNA libraries.
- Thus, when you look up a gene in UniGene
- you get information on its abundance
- and its regional distribution.
Pages 20-21
74Cluster sizes in UniGene
This is a gene with 1 EST associated the cluster
size is 1
Fig. 2.3 Page 23
75Cluster sizes in UniGene
This is a gene with 10 ESTs associated the
cluster size is 10
76Cluster sizes in UniGene (human)
Cluster size (ESTs) Number of clusters 1 ?
42,800 2 6,500 3-4 6,500 5-8 5,400 9-16
4,100 17-32 3,300 ?500-1000 2,128 ?2000-4
000 233 ?8000-16,000 21 ?16,000-30,000 8
UniGene build 194, 8/06
77UniGene unique genes via ESTs
Conclusion UniGene is a useful tool to look
up information about expressed genes.
UniGene displays information about the abundance
of a transcript (expressed gene), as well as its
regional distribution of expression (e.g. brain
vs. liver). We will discuss UniGene further on
September 18 (gene expression).
Page 31
78Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
Page 31
79Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a
premier human genome web browser. We will
encounter Ensembl as we study the human
genome, BLAST, and other topics.
80click human
81enter RBP4
82(No Transcript)
83Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
Page 33
84ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system (ExPASy Expert
Protein Analysis System) Visit
http//www.expasy.ch/
Page 33
85Fig. 2.11 Page 33
86(No Transcript)
87Example of how to access sequence data HIV-1 pol
There are many possible approaches. Begin at the
main page of NCBI, and type an Entrez query
hiv-1 pol
Page 34
88(No Transcript)
89Searching for HIV-1 pol Following the genome
link yields a manageable three results
Page 34
90Example of how to access sequence data HIV-1 pol
For the Entrez query hiv-1 pol there are about
40,000 nucleotide or protein records (and
gt100,000 records for a search for hiv-1), but
these can easily be reduced in two easy
steps --specify the organism, e.g.
hiv-1organism --limit the output to RefSeq!
Page 34
91over 100,000 nucleotide entries for HIV-1
only 1 RefSeq
92Examples of how to access sequence data histone
query for histone results protein
records 21847 RefSeq entries 7544 RefSeq
(limit to human) 1108 NOT deacetylase 697 At
this point, select a reasonable candidate
(e.g. histone 2, H4) and follow its link to
Entrez Gene. There, you can confirm you have the
right gene/protein.
8-12-06
93(No Transcript)
94Access to Biomedical Literature
Page 35
95PubMed at NCBI to find literature information
96PubMed is the NCBI gateway to MEDLINE. MEDLINE
contains bibliographic citations and author
abstracts from over 4,600 journals published in
the United States and in 70 foreign countries.
It has gt14 million records dating back to 1966.
Page 35
97MeSH is the acronym for "Medical Subject
Headings." MeSH is the list of the vocabulary
terms used for subject analysis of biomedical
literature at NLM. MeSH vocabulary is used for
indexing journal articles for MEDLINE. The
MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical
literature.
Page 35
98(No Transcript)
99(No Transcript)
100PubMed search strategies
Try the tutorial (education on the left
sidebar) Use boolean queries (capitalize AND,
OR, NOT) lipocalin AND disease Try using
limits Try Links to find Entrez information
and external resources Obtain articles on-line
via Welch Medical Library (and download pdf
files) http//www.welch.jhu.edu/
Page 35
101lipocalin AND disease (60 results)
1 AND 2
1
2
lipocalin OR disease (1,650,000 results)
1 OR 2
1
2
lipocalin NOT disease (530 results)
1 NOT 2
1
2
Fig. 2.12 Page 34
8/04
102Article contents
globin is absent
globin is present
Search result
false positive (article does not discuss globins)
globin is found
true positive
false negative (article discusses globins)
globin is not found
true negative
8/06
103WelchWeb is available at http//www.welch.jhu.edu
104http//www.welch.jhu.edu
Brian Brown (bbrown20_at_jhmi.edu) and Carrie Iwema
(iwema_at_jhmi.edu) are the Welch Medical Library
liasons to the basic sciences
105Course sponsors Dept. of Molecular
Microbiology Immunology, and Dept. of
Biostatistics, School of Public Health