Title: Principles and application of molecular of Bioinformatics sequence analysis. BINF 5230
1Advances in Molecular Cellular Genetics
Principles and Applicationsof Bioinformatics
BINF 5230 Lecture 1 Spring
2005
2Who am I?
- Alexander Kister, Ph.D
- Research in Bioinformatics.
- kisterae_at_umdnj.edu
- Rm 364
- Phone (973) 972 8596
3Who are you?
- The first and the last name
- Graduated from University, year, major
- Current position
- Interest if any in Bioinformatics (optional)
- in other fields
- Send, please, to
- kisterae_at_umdnj.edu
4No single text is required for this course.
Reference will be made to materials available in
different books and on the internet. Recommended
books BIOINFORMATICS
Sequence and Genome Analysis by
David Mount Introduction to Bioinformatics
by A. Lesk
5- Grading
- 35 Assignment/Homework
- 35 Participation in class and oral
presentations - 30 Final Exam
6Research Support A Function of Academic Computing
Services (ACS) within the University of Medicine
and Dentistry of New Jersey Research software is
subdivided into seven categories Molecular
Docking Autodock, 1 UCSF DOCK, 2
Gold Molecular Biology GCG, SeqWeb Molecular
Dynamics Electrostatics - Amber, 3 Gromacs, 4
NAMD, 5 Delphi6 Comparative Modeling Look,
Modeller Vendor Modeling Suites Sybyl (Tripos,
Inc.), Insight II (Accelrys, Inc.), Gold
Viewers Graphics Rasmol, Deep View, 7 VMD,8
Molscript, 9 Raster3D10 Statistical Analysis
Software - SAS
7Please, Applying for an Academic Computing
Services (ACS) Computer Account
on a UMDNJ Campus and for CGC
password In person at
Newark - MSB Room C632
2-6789
(973-972-6789) Academic Computing Services
Accounts Policy http//www.umdnj.edu/istweb/pro
dserv/acs_acpl.htm
8What is bioinformatics? An introduction and
overview.
9Bioinformatics is a field of science in which
biology,
computer science, and
information technology
merge
into a single discipline. The goal is to reveal
new insights and principles in biology.
10Biological data are flooding in at an
unprecedented rate
Swiss-Prot August 2000 88,166 entries
.
Jun-2003 129,463 entries
. Jan-2004
143,418 entries
.
Jan-2005 167,089 entries
Such growth was made possible not just by the
development of sophisticated machinery cloning
and sequencing, but also by the arrival of a set
of computer programs that allowed to predict
gene-coding regions within the large genomes.
Such programs to reassemble the thousands of
short fragments generated during the sequencing
of an entire genome.
An experimental laboratory can easily produce
over 100 gigabytes of data a day
11An explosive growth in biological information
led to requirement for computerized databases to
store,
organize,
index the data and
to
specialized tools to view and
analyze the data.
12To single way to handle large quantities of data
is a shotgun marriage between biology and
computational techniques The
goal is to reveal new insights and principles in
biology.
13The aims of bioinformatics are three-fold
The First. To organise data in a way that allows
researchers
to access existing information and
to submit new entries as they are
produced,
e.g. the Swiss-Prot
or Protein Data
Bank for 3D macromolecular structures.
14The second.
To develop tools,
algorithms and
statistics
that aid in the analysis of data.
15The third. To use these tools to analyse the data
and interpret the results in a biologically
meaningful manner.
the analysis of data?
For example, to find relationship among members
of large data sets or to
have programs for comparison a sequences with
previously characterised sequences.
16This requires more than just a straightforward
database search. Development of such resources
requires extensive knowledge of computational
theory, as well as a thorough understanding of
biology. This is a reason why this new science
is called by
Bioinformatics
17Traditionally, biological studies examined
individual systems in detail, and frequently
compared them with a few that are related.
In bioinformatics, we can also conduct global
analyses of all the available data with the aim
of uncovering common principles that apply across
many systems and highlight features that are
unique to some.
18The Questionnaire will help me to devise a robust
core bioinformatics curricula for the School of
Computer Science and IT at the University of
Nottingham . From the following list, check all
that apply to you
I am a bioinformatician.
I am a
research scientist with skills in bioinformatics.
I am a biologist turned bioinformatician.
I am a computer scientist
turned bioinformatician.
I research
in bioinformatics.
I teach bioinformatics
I use
bioinformatics in my work activities
194. Indicate your agreement with the following
statement There are two types of
bioinformaticians tool builders and tool
users. Agree Agree somewhat Disagreea. Do
these tool builders and tool users need to be
trained separately? Yes No Unsure
206. Which of these biological and computational
concepts (if any) are essential on a
bioinformatician background?
Foundations of
Molecular Biology
Systems Biology
Basic Evolutionary theory
Basic laboratory techniques (e.g.
sequencing, DNA arrays, etc) Computational
strategies for inferring protein functions,
determination of gene families, etc
Methods for
molecular structure analysis (e.g. structure
prediction, molecular dynamics, modeling,
comparison, and so on)
Sequence comparison
Phylogenetic reconstruction
Combinatorial approaches to
sequencing RNA secondary and tertiary structure
prediction
Sequence feature extraction and
annotation SNP detection and utilization Gene
expression analysis
Regulatory network modeling
21Discrete math, linear algebra, Advanced
statistics Applied
probabilities
Empirical problem solving Modeling (e.g. Hydden
Markov processes, neural networks, cluster
analysis, etc.)
Combinatorial optimisation methods and algorithms
Dynamic programming,)
Computing techniques (e.g. Neural networks, fuzzy
sets and systems, evolutionary computation, etc)
Probabilistic machine learning
Design and analysis of algorithms and
data structures
Databases Data mining and Knowledge discovery
methodologies Programmin languages (e.g. C, C,
C, Java, Perl, etc)
Security issues Web-design Software engineering
Networking Distributed/Paralle
l computing
22View Current Issue January 15 2005
http//bioinformatics.oupjournals.org/
23Positions currently advertised.
24In this course, we focus on the aims of
bioinformatics with particular reference to the
keywords
sequence information, sequence
organisation,
sequence understanding,
large-scale
analysis and
practical applications.
25 What is a biological
database? A biological database is a large
collection of data, associated with computerized
software to retrieve, update and query components
of the database. Researches have
Easy
access to the information A method for extracting
only that information needed to answer a specific
biological question.
26INFORMATION associated with molecules. Sources
of data used in bioinformatics
Data source Data size
Bioinformatics topics
Raw DNA GenBank Release 133
Separating coding .
28.5 billion sequence and
.
Bases non-coding
.
.
from regions
22.3 million
.
Sequences Identification of
.
. (84
gigabytes) introns and exons
Gene product .
prediction
Forensic analysis
collection of all publicly available DNA
sequences
27Data source Data size
Bioinformatics topics Protein sequence
Sequence comparison
algorithms
Multiple sequence .
. alignments
algorithms.
Identification of .
conserved sequence .
motif
Swiss-Prot 167,089 entries
TrEMBL 1,560,235 entries
total .
1,727,324
28Data source Data size
Bioinformatics topics
Macromolecular
Secondary, tertiary structure
prediction.
.
3D structural
alignment .
algorithms.
.
Protein geometry
.
measurements
.
Surface and volume .
shape calculations.
.
Intermolecular
.
interactions.
.
Molecular simulations
.
(force-field calculations,
.
molecular movements,
.
docking predictions)
PDB 29,101 StructuresLast Update 11-Jan-2005
29Data source Data size
Bioinformatics topic
Genomes 100 complete genomes Structural
assignments . (1.6 million
to genes
. 3
billion bases each) Characterisation of .
repeats .
Characterisation of .
protein content
.
Metabolic pathways .
Phylogenetic analysis .
Linkage analysis .
relating
specific genes .
to diseases
.
Mapping expression .
data to sequence,
.
Structural and .
.
biochemical data
30ORGANISE the information on a LARGE SCALE
A main concept in Bioinformatics is that the data
can be grouped together based on biologically
meaningful similarities, and reasonable smart
classification.
Similar sequences?
Similar structures?
Protein family - pairwise residue identities
between the proteins are 30 and greater Clear
evolutionarily relationship
Protein fold the same major secondary
structures in the same arrangement and with the
same topological connections.
Major structural similarity
31Different types of Protein sequence databases
- Primary databases - a repository for the raw
data. - SWISS-PROT and PIR databases annotate the
sequences as well as describe the proteins
functions, its domain structure
The pivotal role of annotation - only 15 of the
SWISS-PROT protein-sequence databank is actually
sequence the rest comprises database and
literature cross references, biological
descriptions and other explanatory notes.
In the absence of truly reliable robots for
annotation, bioinformatics is still highly
dependent on people who are able not only to use
the standard tools and databases, but also to
apply sufficient biological expertise to the
analysis of the resultsThe bioinformatics
specialists will be required for many years to
come. Bioinformatics goes back to the future
Crispin J.Miller and Teresa K.
Attwood NATURE REVIEWS MOLECULAR CELL BIOLOGY
VOLUME 4 Feb. 2003
32Nucleotide and Genome Database
GenBank is the collection of all publicly
available DNA sequences. Three organizations
exchange data on a daily basis. GenBank at the
National Center for Biotechnology Information DNA
DataBank of Japan (DDBJ),
European Molecular Biology Laboratory (EMBL) A
new release is made every two months.
33Different types of Protein sequence databases
2) Composite databases - compile and filter
sequence data from different primary databases.
OWL - is a non-redundant composite of 4
publicly-available primary sources SWISS-PROT,
PIR, GenBank (translation) and NRL-3D.
The strict redundancy criteria render
OWL relatively "small" and hence efficient in
similarity searches. Kabat database OMIM,
Online Mendelian Inheritance in Man - This
database is a catalog of human genes and genetic
disorders
34Protein sequence databases
3) Secondary databases contain information
derived from protein sequences and help the user
determine whether a new sequence belongs to a
known protein family. PROSITE, PRINTS Pfam is a
large collection databases of short sequence
patterns and conserved motifs that characterise
biologically significant sites in proteins and a
protein family. of multiple sequence alignments
and hidden Markov models covering many common
protein families.
35structural databases
The Protein Data Bank (PDB) is operated by
Rutgers, The State University of
New Jersey the San
Diego Supercomputer Center at the University of
California, San Diego and
the Center for
Advanced Research in Biotechnology of the
National Institute of Standards and Technology
The PDB, the single worldwide repository for the
processing and distribution of 3-D biological
macromolecular structure data.Current Holdings
In 2004, 5,356
structures 29,101 Structures
were deposited to the
PDB Last Update 11-Jan-2005
a 14.5 increase
.
over 2003's 4677
depositions
36PDB Content Growth
37Three major databases classify proteins by
structure in order to identify structural and
evolutionary relationships CATH, SCOP and FSSP
databases.
- Root scop
- Classes
- All alpha proteins (179)
- All beta proteins (126)
- Alpha and beta proteins (a/b) (121) Mainly
parallel beta sheets (beta-alpha-beta units) - Alpha and beta proteins (ab) (234) Mainly
antiparallel beta sheets (segregated alpha and
beta regions)
. . .
38Superposition of the binding sites of a series of
dihydrofolate reductase complexes. This
functionality enables the user to switch on and
off individual protein structures and their
ligands and water molecules.
39Timeline The History before Genome Project
40 1869 -
Johann Miescher discovered
DNA and named it nuclein, because it was isolated
from the nucleus (central core) of cells.
411909 - The word gene coined Danish botanist
Wilhelm Johannsen coined the word gene to
describe the Mendelian units of heredity. The
proposed word traced from the Greek word genos,
meaning "birth".
42Alfred Hershey and Martha Chase showed that only
the DNA of a virus needs to enter a bacterium to
infect it.
Electron microscope images showed that a
bacterial virus bacteriophage T4 attaches to
a bacterium to infect it.
Hershey and Chase figured that the virus
transferred genetic material into the bacterium
to direct the production of more virus.
They knew that bacteriophage T4 was made of
Protein and DNA.
43Genes are made of ? Your suggestion?
44Genes are made of ?
Hershey and Chase knew that proteins contain
sulfur atoms but no phosphorus
CH3
.
S .
SH
CH2 . CH2
CH2 .H3N CH-COO-
H3N CH-COO-
CYSTEINE
METHIONINE
45Genes are made of ?
while DNA contains a great deal of phosphorus
and no sulfur.
46Hershey and Chase used radioactive sulfur and
phosphorus to label and, so, distinguish viral
proteins from viral DNA. After allowing labeled
viruses to infect bacteria, they observed that
the radioactive phosphorus enters the bacteria
while the radioactive sulfur always remains
outside.
Their experiment provided strong support for the
idea that genes are made of DNA.
47What was known about DNA at that time?
- 1) DNA is made of nucleotides, which is in turn
made of three parts - a phosphate group that is linked to
- a deoxyribose sugar, which is in turn linked to
- one of four nitrogenous bases adenine (A),
cytosine (C), guanine (G), or thymine (T).
482) Nucleotides are linked into a chain
The 3'-hydroxyl group on the shugar unit,
reacts with the 5'-phosphate group on it's
neighbor to form a chain.
492) Erwin Chargaff rule In 1949 he discovered
that in the DNA of any given type of cell
the amount of adenine
approximately equals the amount of thymine,
. A T
.
while the amount of cytosine approximately equals
the amount of guanine. C G
50- 4) X-ray diffraction patterns, obtained
by Rosalind Franklin and Maurice Wilkins,
revealed great symmetry and consistency in the
structure of DNA and gave important clues about
its dimensions.
-
51James Watson and Francis Crick (April 25, 1953)
Nature Molecular Structure of Nucleic Acids. A
structure for Deoxiribose Nucleic Acid.
Photo of the first metal model of the double
helix
52 Fundamental elements of DNA structure
A T and G C The nucleotide bases use hydrogen
bonds to pair specifically, with an an A always
opposing a T, and a C always opposing a G.
53Their 1 page, 900-word paper, published in
Nature, concluded, famously, "It has not
escaped our notice that the specific pairing we
have postulated immediately suggests a possible
copying mechanism for the genetic material."
Watson, Crick, and Wilkins received the Nobel
Prize for Physiology or Medicine in 1962.
James Watson wrote a personal account of his
famous discovery and the people involved James
D. Watson "The Double HelixA Personal Account
of the Discovery of the Structure of DNA". The
book, originally published in 1968.
54 Molecular Biology/Structure of DNA
Note orientation of antiparallel strands
Watson-Crick base pairing
551961 - mRNA ferries information
Brenner, Jacob, and Meselson discovered that mRNA
is the molecule that takes information from DNA
in the nucleus to the protein
The three predominant forms of RNA are all
involved in translating the genetic
information from the sequence of bases in DNA
to a sequence of
amino acids in proteins. messenger RNA (mRNA),
transfer RNA (tRNA), and ribosomal RNA (rRNA).
56RNA is chemically similar to DNA except that
- the sugar in its nucleotide building blocks is
ribose and not deoxyribose.
- RNA uses the nucleotide base uracil instead of
thymine.
- RNA, especially mRNA, tends to be
single-stranded, not double-stranded like DNA.
But like thymine, uracil can pair with adenine.
57 Overview of Transcription and Translation
Transcription - production of RNA from
DNA Translation - production of protein
from RNA
58 tRNAs Link mRNA with Amino Acids
tRNA pictured is the specific, or cognate, tRNA
for W First base in the anticodon pairs with the
third base of codon
59 The Universal Genetic Code
- Code is about 50 GC and 50 AT(U) in first 2
positions of codons - Last codon position is
least specific for coding an aa (wobble) there
are only 3 codons, AUG, UGG, and UGA in which a
unique meaning is conferred by a particular base
at the third position. - Note unequal number of
codons for different aa only a weak correlation
between number of codons and frequency of aa
use. - Codons coding for same aa are called
synonymous codons. - Mutations...
60 General Structure of an Amino Acid
?
61Home assignment
Side chains ?
Characteristics of amino
acids ?
Possible classification of amino acids ?
62 Structure of a peptide bond
Note orientation of polypeptide
63 Main Chain Torsion Angles
Omega (?)
64 Levels of Protein Structure
Core elements
65Human Genome Project
66 1990 - Launch of the Human Genome
Project With the official launch of the Human
Genome Project, the National Institutes of Health
and the Department of Energy established goals
for the first five years of the project.
The goals included
1)mapping the human genome and eventually
determining the sequence of all 3.2 billion
letters in it
2)mapping and sequencing the genomes of other
organisms important to the study of biology
3)developing technology for analyzing DNA
4)studying the ethical, legal, and social
implications of this research.
671994 - Detailed human genetic map
September - one year ahead of schedule
Genetic maps (also called linkage maps) show
only approximate and relative distances between
genes based on genetic markers It has indicated
that a gene lies in a particular region.
68Genetic map helps us to solve One of the primary
goals of HGP to find disease genes
1) On which chromosome a gene lies?
2) Approximately where in
that chromosome? 3) How to find
a disease gene?
The main idea is that if a particular genetic
marker is inherited with a disease gene
the gene likely resides near the genetic marker.
the map had more markers than originally
proposed. It had nearly 6,000 markers.
691998 - HGP map includes 30,000 human genes
HGP researchers released a gene map that included
30,000 human genes, estimated to represent
approximately one-third of the total human genes.
70The next step is to find an exact localization of
a single GENE in 3 billion base pairs of DNA
that makes up the human genome. It
requires PHYSICAL MAPS -
shows
the actual distance in base pair between
different positions in a molecule
711995 - Two microbial genomes sequenced
July The first complete genome of the bacterium
Haemophilus influenzae 1,830,137 base-pair (a
bit over 5 of the size of the human genome) The
sequence revealed the complete instruction book
of a free-living organism for the first time.
Haemophilus influenzae causes respiratory and
other infections and flu.
721995 - Two microbial genomes sequenced
October Mycoplasma genitalium - the smallest
known genome 580,070 base pairs of DNA and
470 predicted genes
apparently represent a basic
set of genes necessary for independent existence.
731997 - E. coli genome sequenced
The E. coli genome consists of about 4,600,000
base pairs and contains approximately 4,000 genes.
The strain of E. coli used for the sequencing
project is not a pathogen (that is, it does not
cause disease). Comparing the normal strain with
pathogenic strains is expected to help suggest
treatments for these illnesses and strategies to
prevent infection.
742000 - Working draft
February 12, 2001 WASHINGTON, D.C. - The Human
Genome Project international consortium today
announced the publication of a draft sequence and
initial analysis of the human genome - the
genetic blueprint for a human being.
The draft sequence, covers more than 90 percent
of the human genome.
This DNA text influences
everything from eye color and height, to aging
and disease.
75The highlights of the text for the "Book of
Life."
- The Billion-Dollar Question How Many Genes Are
There? - Scientists now estimate that humans have some
30,000 to 35,000 genes in their genomes. - This new estimate indicates that humans have only
about twice as many genes as the worm or the fly.
- Question
- How can human complexity be explained by a genome
with such a paucity of genes?
It turns out humans are able to do more with what
they have than other species. Instead of
producing only one protein per gene, human genes
can produce several different proteins.
76The HGP (1990-2005) has a cost of 3 billion.
These include
Genome sequencing of
Human, bacteria, yeast, worms, flies and mice and
Studies of human diseases
Development of
new technologies for biological and medical
research
Computational methods
to analyze genomes and Ethical, legal
and social issues related to genetics.
Human genome sequencing is approximately
300 million.
771990 - ELSI founded Ethical, Legal and Social
Implications (ELSI) programs founded at NIH and
DOE. The information gained from mapping and
sequencing the human genome would have the
potential to dramatically improve human
health. It would also raise a number of complex
ethical, legal and social issues
How should the
newly accessible genetic information be
interpreted and used?
Who should have access to it?
How can people be protected
from the harm that might result from its improper
disclosure or use?
781995 - Ban on genetic discrimination in the U.S
The Americans with Disabilities Act (ADA) would
protect individuals subjected to discrimination
on the "basis of genetic information relating to
illness, disease or other disorders"
An example a person with a genetic test showing
a predisposition for colon cancer would be
protected under the ADA from discriminationes
against him or her because of that perception
79- Home assignment
- The new science Bioinformatics. What are the
most interesting topics for me in this science.
(small essay, 1-2 pages) - What database I usually use for my research or
analysis? What is the goal of my research. Why I
select this database? How to work with the
database?
(oral presentation) - You create your own database for amino acids
properties. What principles
to create the database do you use? What
properties do you classify? How your database
looks like?
(Small project be ready for
oral presentation)