Title: An Introduction to Bioinformatics. CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt
1An Introduction to Bioinformatics.CSE, Marmara
University mimoza.marmara.edu.tr/m.sakalli/cse54
6Oct/12/09Source http//bio.fsu.edu/stevet/BS
C5936/BioDataBases.ppt
2Terminology
- Bioinformatics using computational techniques to
access, analyze, and interpret the biological
information. Tool Building. Biocomputing and
computational biology are the synonyms. - Sequence analysis is the study of molecular
sequence data. - Genomics analyzes the context of genes or
complete genomes. - Proteomics is the subdivision of genomics
concerned with analyzing the protein complement,
i.e. the proteome. - The Human Genome Project and numerous the data
coming at alarming rates. - Homo sapiens the 3.2 billion base pairs
Estimates of the number of genes were around
100,000 range but turns out to be twice as many
as a fruit fly, between 25 and 35,000! - The protein coding region of the genome is only
about 1 or so, a bunch of the remainder is
jumping selfish DNA of which much may be
involved in regulation and control.
3- Three major databases with their own specific
format. Mirrored among each other and sharing
accession codes, but NOT identifier names - 1) National Center for Biotechnology Information
(NCBI),/the National Library of Medicine (NLM),
at the NIH, (Gene bank and GenPept). - http//www.ncbi.nlm.nih.gov/
- http//www.ncbi.nlm.nih.gov/Genbank/GenbankOvervie
w.html - Georgetown Universitys National Biomedical
Research Foundation Protein Identification
Resource and Naval Research Lab sequences of
three-dimensional structure. - http//www-nbrf.georgetown.edu/
- http//www-nbrf.georgetown.edu/pirwww/dbinfo/nrl3d
.html - 2)
- European Molecular Biology Laboratory
- http//www.ebi.ac.uk/embl/index.html,
http//www.embl-heidelberg.de/ - European Bioinformatics Institute,
- http//www.ebi.ac.uk/
- Swiss Institute of Bioinformatics (SIB), Expert
Protein Analysis System - http//www.expasy.ch/, http//www.expasy.org/links
.html - Nucleotide Sequence Database, amino acid sequence
databases - http//expasy.cbr.nrc.ca/sprot/
- 3)
- http//www.ddbj.nig.ac.jp/
4- Atlas of Protein Sequence and Structure The
first well recognized protein sequence database,
mid sixties, by Dr. Margaret Dayhoff. - DDBJ began in 1984, GenBank in 1982, and EMBL in
1980. They are all attempts at establishing an
organized, reliable, comprehensive and openly
available library of genetic sequences. - Each program needs to recognize particular
aspects of the sequence files flexibility of the
program is a headache. NCBIs ASN.1 format and
its Entrez interface attempt to reduce these
prbls. - Unfortunately, not like ieee working groups for
internet taskforce, RFCies for example, format
issues are the most confusing and troubling
aspect of working with primary sequence data. - Sequence database installations are commonly a
complex ASCII/Binary mix, but neither relational
nor OOP (often proprietary). - Contain several very long text files each
containing different types of information all
related to particular sequences. - Software is usually required to interact with
these databases. ReadSeq of Don Gilbert (a
reformatting program, for DNA and protein
sequences, accepting single or multiple inputs in
18 different formats, converting to a specified
format. )
5- http//www.molecularevolution.org/
- AWTY (Are We There Yet?) is a system for
graphically exploring convergence of Markov Chain
Monte Carlo (MCMC) chains in Bayesian
phylogenetic inference (Nylander et al. 2008). - FigTree to graphically view phylogenetic trees.
- Clustal W (Thompson et al. 1994) is for global
multiple sequence alignment. Using a progressive
alignment algorithm with affine gap penalties and
a guide tree based on sequence similarity to
align DNA or amino acid sequences. The affine gap
cost model penalizes insertions and deletions
using a linear function in which one term is
length independent, and the other is length
dependent. Gap penalty Gapopen Len
Gapextend. Recent reviews comparing multiple
alignment algorithms (e.g., Hickson et al. 2000,
Thompson et al. 1999, and McClure et al. 1994).
Morrison and Ellis (1997) discuss the effects of
nucleotide sequence alignment on the estimation
of phylogenetic hypotheses. The current version
is Clustal W2 (Larkin et al. 2007). The program
is also available with a graphical user
interface, Clustal X. - BEAST, (Beauti), -Bayesian Evolutionary Analysis
Sampling Trees- is for evolutionary inference of
molecular sequences, Andrew Rambaut and Alexei
Drummond (Drummond et al. 2002 2005 2006). - FASTA compares pairs of protein or DNA sequences
as well as comparing a single protein or DNA
sequence to a database or library. Fast and local
or remote services. - GARLI (Genetic Algorithm for Rapid Likelihood
Inference) performs phylogenetic searches on
aligned nucleotide datasets using the maximum
likelihood criterion. - MAFFT implements FFT to optimize protein
alignments based on physical properties of the
amino acids (Katoh et al., 2002 2005). The
program uses progressive alignment followed by
refinement, also known as iterative alignment.
6- All sequence databases contain (in their own
format) - Name (Genetic identifiers) LOCUS, ENTRY, ID
- Definition A brief, one-line, textual sequence
description. - Accession Number A constant data identifier.
- Source and classification (taxonomy) information.
- Complete literature references.
- Comments and keywords.
- The all important FEATURE table!
- A summary or checksum line.
- The sequence itself.
7- LOCUS HSEF1AR 1506 bp
mRNA linear PRI 12-SEP-1993 - DEFINITION Human mRNA for elongation factor 1
alpha subunit (EF-1 alpha). - ACCESSION X03558
- VERSION X03558.1 GI31097
- KEYWORDS elongation factor elongation factor
1. - SOURCE human.
- ORGANISM Homo sapiens
- Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi - Mammalia Eutheria Primates
Catarrhini Hominidae Homo. - REFERENCE 1 (bases 1 to 1506)
- AUTHORS Brands,J.H., Maassen,J.A., van
Hemert,F.J., Amons,R. and Moller,W. - TITLE The primary structure of the alpha
subunit of human elongation - JOURNAL Eur. J. Biochem. 155 (1), 167-171
(1986) - MEDLINE 86136120
- FEATURES Location/Qualifiers
- source 1..1506
- /organism"Homo sapiens"
- /db_xref"taxon9606"
- CDS 54..1442
8EMBL and SWISS-PROT
- ID EF11_HUMAN STANDARD PRT 462 AA.
- AC P04720 P04719
- DT 13-AUG-1987 (Rel. 05, Created)
- DE Elongation factor 1-alpha 1 (EF-1-alpha-1)
(Elongation factor 1 A-1) - DE (eEF1A-1) (Elongation factor Tu) (EF-Tu).
- GN EEF1A1 OR EEF1A OR EF1A.
- OS Homo sapiens (Human),
- OS Bos taurus (Bovine), and
- OS Oryctolagus cuniculus (Rabbit).
- OC Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi - OC Mammalia Eutheria Primates Catarrhini
Hominidae Homo. - OX NCBI_TaxID9606, 9913, 9986
- RN 1
- RP SEQUENCE FROM N.A.
- RC SPECIESHuman
- RX MEDLINE86136120 PubMed3512269
- RA Brands J.H.G.M., Maassen J.A., van Hemert
F.J., Amons R., Moeller W. - RT "The primary structure of the alpha subunit
of human elongation . -binding sites." - RL Eur. J. Biochem. 155167-171(1986).
9PIR/NBRF format
- ENTRY EFHU1 type complete
iProClass View of EFHU1 - TITLE translation elongation factor
eEF-1 alpha-1 chain - human - ALTERNATE_NAMES translation elongation factor Tu
- ORGANISM formal_name Homo sapiens
common_name man - cross-references taxon9606
- DATE 30-Jun-1988 sequence_revision
05-Apr-1995 text_change.. - ACCESSIONS B24977 A25409 A29946 A32863
I37339 - REFERENCE A93610
- authors Rao, T.R. Slobin, L.I.
- journal Nucleic Acids Res. (1986) 142409
- title Structure of the amino-terminal
end of mammalian elongation - accession B24977
- molecule_type mRNA
- residues 1-82,'A',84-94 label RAO
- cross-references EMBLX03689 NIDg31109
PIDNCAA27325.1 - PIDg31110.
- GENETICS
- gene GDBEEF1A1 EEF1A EF1A
- cross-references GDB118791 OMIM130590
10Examples of DBs with specialized type of sequences
- Almost all the links Human Genome Ensemble
Project at http//www.ensembl.org/ - Patterns, motifs, and profiles REBASE, EPD,
PROSITE, - Aligned multiple sequence entries. RDP and ALN.
- Functionally, structurally, or phylogenetically
ordered iProClass and HOVERGEN vertebrate gene
db. - HIV Database, and the Giardia lamblia Genome
Project. - 3D Structure, atomic coordinate data is necessary
to define the tertiary shape of a particular
biological molecule. Protein DB and Rutgers
Nucleic Acid Db. - MolBio Molecular visualization with special
software. - Genomic linkage mapping databases for H. sapiens,
Mus, Drosophila, C. elegans, Saccharomyces,
Arabidopsis, E. coli. - OMIM Online Mendelian Inheritance in Man
- Phylogenetic Tree Databases e.g. the Tree of
Life.
11- Theres a bewildering assortment of different
databases and ways to access and manipulate the
information within them. The key is to learn how
to use that information in the most efficient
manner. - For example Given a novel genome sequence, find
all genes and p-genes. - I want to design "sequence capture" probes for
the exons of 40 genes that cause RP. - Obtain the exonic sequence, with at least 100
nt's flanking, and 1000 nts of the promoter from
transcription start - I propose a new way to find disease-causing
mutations in humans. I want to only look in
genes that have regions that are 1) highly
conserved across species, 2) have known
functional protein domains (ex. transmembrane
domains), and 3) have mRNA secondary structure.
Is this a good idea? - 1859 of Charles Darwins The Origin of Species
- Basic Mendelian Genetics
- Mendels laws
- independent assortment
- independent segregation
- mitosis and meiosis
- dominant/recessive and pedigrees (the graphs of
phenotype) - alleles
- Basic molecular genetics
- DNA
- RNA
12Pearson FastA format GCG single
sequence format
- gtEFHU1 PIR1 release 71.01
- MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG
- KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK
- NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV
- GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN
- MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL
- QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS
- EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP
- GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA
- IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK
- VTKSAQKAQKAK
!!AA_SEQUENCE 1.0 P1EFHU1 - translation
elongation factor eEF-1 alpha-1 chain -
human NAlternate names translation elongation
factor Tu F1-223/Domain eEF-1 alpha domain I,
GTP-binding status predicted ltEF1gt F8-156/Domain
translation elongation factor Tu homology
ltETUgt F14-21/Region nucleotide-binding motif A
(P-loop) F153-156/Region GTP-binding NKXD
motif F245-330/Domain eEF-1 alpha domain II,
tRNA-binding status predicted ltEF2gt F332-462/Dom
ain eEF-1 alpha domain III, tRNA-binding status
predicted ltEF3gt F36,55,79,165,318/Modified site
N6,N6,N6-trimethyllysine (Lys) status
predicted F301,374/Binding site
glycerylphosphorylethanolamine (Glu) (covalent)
status predicted EFHU1 Length 462 January 14,
2002 1949 Type P Check 5308 .. 1
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE
KFEKE 401 IVDMVPGKPM CVESFSDYPP
LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 351
GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG
PKFLKSGDAA 451 VTKSAQKAQK AK