An Introduction to Bioinformatics. CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt - PowerPoint PPT Presentation

About This Presentation

Title:

An Introduction to Bioinformatics. CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt

Description:

1) National Center for Biotechnology Information (NCBI),/the National Library of ... Definition: A brief, one-line, textual sequence description. ... – PowerPoint PPT presentation

Number of Views:156

Avg rating:3.0/5.0

Slides: 13

Provided by: steve961

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Bioinformatics. CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt

1
An Introduction to Bioinformatics.CSE, Marmara
University mimoza.marmara.edu.tr/m.sakalli/cse54
6Oct/12/09Source http//bio.fsu.edu/stevet/BS
C5936/BioDataBases.ppt
2
Terminology

Bioinformatics using computational techniques to
access, analyze, and interpret the biological
information. Tool Building. Biocomputing and
computational biology are the synonyms.
Sequence analysis is the study of molecular
sequence data.
Genomics analyzes the context of genes or
complete genomes.
Proteomics is the subdivision of genomics
concerned with analyzing the protein complement,
i.e. the proteome.
The Human Genome Project and numerous the data
coming at alarming rates.
Homo sapiens the 3.2 billion base pairs
Estimates of the number of genes were around
100,000 range but turns out to be twice as many
as a fruit fly, between 25 and 35,000!
The protein coding region of the genome is only
about 1 or so, a bunch of the remainder is
jumping selfish DNA of which much may be
involved in regulation and control.

Three major databases with their own specific
format. Mirrored among each other and sharing
accession codes, but NOT identifier names
1) National Center for Biotechnology Information
(NCBI),/the National Library of Medicine (NLM),
at the NIH, (Gene bank and GenPept).
http//www.ncbi.nlm.nih.gov/
http//www.ncbi.nlm.nih.gov/Genbank/GenbankOvervie
w.html
Georgetown Universitys National Biomedical
Research Foundation Protein Identification
Resource and Naval Research Lab sequences of
three-dimensional structure.
http//www-nbrf.georgetown.edu/
http//www-nbrf.georgetown.edu/pirwww/dbinfo/nrl3d
.html
2)
European Molecular Biology Laboratory
http//www.ebi.ac.uk/embl/index.html,
http//www.embl-heidelberg.de/
European Bioinformatics Institute,
http//www.ebi.ac.uk/
Swiss Institute of Bioinformatics (SIB), Expert
Protein Analysis System
http//www.expasy.ch/, http//www.expasy.org/links
.html
Nucleotide Sequence Database, amino acid sequence
databases
http//expasy.cbr.nrc.ca/sprot/
3)
http//www.ddbj.nig.ac.jp/

Atlas of Protein Sequence and Structure The
first well recognized protein sequence database,
mid sixties, by Dr. Margaret Dayhoff.
DDBJ began in 1984, GenBank in 1982, and EMBL in
1980. They are all attempts at establishing an
organized, reliable, comprehensive and openly
available library of genetic sequences.
Each program needs to recognize particular
aspects of the sequence files flexibility of the
program is a headache. NCBIs ASN.1 format and
its Entrez interface attempt to reduce these
prbls.
Unfortunately, not like ieee working groups for
internet taskforce, RFCies for example, format
issues are the most confusing and troubling
aspect of working with primary sequence data.
Sequence database installations are commonly a
complex ASCII/Binary mix, but neither relational
nor OOP (often proprietary).
Contain several very long text files each
containing different types of information all
related to particular sequences.
Software is usually required to interact with
these databases. ReadSeq of Don Gilbert (a
reformatting program, for DNA and protein
sequences, accepting single or multiple inputs in
18 different formats, converting to a specified
format. )

http//www.molecularevolution.org/
AWTY (Are We There Yet?) is a system for
graphically exploring convergence of Markov Chain
Monte Carlo (MCMC) chains in Bayesian
phylogenetic inference (Nylander et al. 2008).
FigTree to graphically view phylogenetic trees.
Clustal W (Thompson et al. 1994) is for global
multiple sequence alignment. Using a progressive
alignment algorithm with affine gap penalties and
a guide tree based on sequence similarity to
align DNA or amino acid sequences. The affine gap
cost model penalizes insertions and deletions
using a linear function in which one term is
length independent, and the other is length
dependent. Gap penalty Gapopen Len
Gapextend. Recent reviews comparing multiple
alignment algorithms (e.g., Hickson et al. 2000,
Thompson et al. 1999, and McClure et al. 1994).
Morrison and Ellis (1997) discuss the effects of
nucleotide sequence alignment on the estimation
of phylogenetic hypotheses. The current version
is Clustal W2 (Larkin et al. 2007). The program
is also available with a graphical user
interface, Clustal X.
BEAST, (Beauti), -Bayesian Evolutionary Analysis
Sampling Trees- is for evolutionary inference of
molecular sequences, Andrew Rambaut and Alexei
Drummond (Drummond et al. 2002 2005 2006).
FASTA compares pairs of protein or DNA sequences
as well as comparing a single protein or DNA
sequence to a database or library. Fast and local
or remote services.
GARLI (Genetic Algorithm for Rapid Likelihood
Inference) performs phylogenetic searches on
aligned nucleotide datasets using the maximum
likelihood criterion.
MAFFT implements FFT to optimize protein
alignments based on physical properties of the
amino acids (Katoh et al., 2002 2005). The
program uses progressive alignment followed by
refinement, also known as iterative alignment.

All sequence databases contain (in their own
format)
Name (Genetic identifiers) LOCUS, ENTRY, ID
Definition A brief, one-line, textual sequence
description.
Accession Number A constant data identifier.
Source and classification (taxonomy) information.
Complete literature references.
Comments and keywords.
The all important FEATURE table!
A summary or checksum line.
The sequence itself.

LOCUS HSEF1AR 1506 bp
mRNA linear PRI 12-SEP-1993
DEFINITION Human mRNA for elongation factor 1
alpha subunit (EF-1 alpha).
ACCESSION X03558
VERSION X03558.1 GI31097
KEYWORDS elongation factor elongation factor
1.
SOURCE human.
ORGANISM Homo sapiens
Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates
Catarrhini Hominidae Homo.
REFERENCE 1 (bases 1 to 1506)
AUTHORS Brands,J.H., Maassen,J.A., van
Hemert,F.J., Amons,R. and Moller,W.
TITLE The primary structure of the alpha
subunit of human elongation
JOURNAL Eur. J. Biochem. 155 (1), 167-171
(1986)
MEDLINE 86136120
FEATURES Location/Qualifiers
source 1..1506
/organism"Homo sapiens"
/db_xref"taxon9606"
CDS 54..1442

8
EMBL and SWISS-PROT

ID EF11_HUMAN STANDARD PRT 462 AA.
AC P04720 P04719
DT 13-AUG-1987 (Rel. 05, Created)
DE Elongation factor 1-alpha 1 (EF-1-alpha-1)
(Elongation factor 1 A-1)
DE (eEF1A-1) (Elongation factor Tu) (EF-Tu).
GN EEF1A1 OR EEF1A OR EF1A.
OS Homo sapiens (Human),
OS Bos taurus (Bovine), and
OS Oryctolagus cuniculus (Rabbit).
OC Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi
OC Mammalia Eutheria Primates Catarrhini
Hominidae Homo.
OX NCBI_TaxID9606, 9913, 9986
RN 1
RP SEQUENCE FROM N.A.
RC SPECIESHuman
RX MEDLINE86136120 PubMed3512269
RA Brands J.H.G.M., Maassen J.A., van Hemert
F.J., Amons R., Moeller W.
RT "The primary structure of the alpha subunit
of human elongation . -binding sites."
RL Eur. J. Biochem. 155167-171(1986).

9
PIR/NBRF format

ENTRY EFHU1 type complete
iProClass View of EFHU1
TITLE translation elongation factor
eEF-1 alpha-1 chain - human
ALTERNATE_NAMES translation elongation factor Tu
ORGANISM formal_name Homo sapiens
common_name man
cross-references taxon9606
DATE 30-Jun-1988 sequence_revision
05-Apr-1995 text_change..
ACCESSIONS B24977 A25409 A29946 A32863
I37339
REFERENCE A93610
authors Rao, T.R. Slobin, L.I.
journal Nucleic Acids Res. (1986) 142409
title Structure of the amino-terminal
end of mammalian elongation
accession B24977
molecule_type mRNA
residues 1-82,'A',84-94 label RAO
cross-references EMBLX03689 NIDg31109
PIDNCAA27325.1
PIDg31110.
GENETICS
gene GDBEEF1A1 EEF1A EF1A
cross-references GDB118791 OMIM130590

10
Examples of DBs with specialized type of sequences

Almost all the links Human Genome Ensemble
Project at http//www.ensembl.org/
Patterns, motifs, and profiles REBASE, EPD,
PROSITE,
Aligned multiple sequence entries. RDP and ALN.
Functionally, structurally, or phylogenetically
ordered iProClass and HOVERGEN vertebrate gene
db.
HIV Database, and the Giardia lamblia Genome
Project.
3D Structure, atomic coordinate data is necessary
to define the tertiary shape of a particular
biological molecule. Protein DB and Rutgers
Nucleic Acid Db.
MolBio Molecular visualization with special
software.
Genomic linkage mapping databases for H. sapiens,
Mus, Drosophila, C. elegans, Saccharomyces,
Arabidopsis, E. coli.
OMIM Online Mendelian Inheritance in Man
Phylogenetic Tree Databases e.g. the Tree of
Life.

Theres a bewildering assortment of different
databases and ways to access and manipulate the
information within them. The key is to learn how
to use that information in the most efficient
manner.
For example Given a novel genome sequence, find
all genes and p-genes.
I want to design "sequence capture" probes for
the exons of 40 genes that cause RP.
Obtain the exonic sequence, with at least 100
nt's flanking, and 1000 nts of the promoter from
transcription start
I propose a new way to find disease-causing
mutations in humans. I want to only look in
genes that have regions that are 1) highly
conserved across species, 2) have known
functional protein domains (ex. transmembrane
domains), and 3) have mRNA secondary structure.
Is this a good idea?
1859 of Charles Darwins The Origin of Species
Basic Mendelian Genetics
Mendels laws
independent assortment
independent segregation
mitosis and meiosis
dominant/recessive and pedigrees (the graphs of
phenotype)
alleles
Basic molecular genetics
DNA
RNA

12
Pearson FastA format GCG single
sequence format

gtEFHU1 PIR1 release 71.01
MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG
KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK
NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV
GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN
MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL
QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS
EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP
GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA
IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK
VTKSAQKAQKAK

!!AA_SEQUENCE 1.0 P1EFHU1 - translation
elongation factor eEF-1 alpha-1 chain -
human NAlternate names translation elongation
factor Tu F1-223/Domain eEF-1 alpha domain I,
GTP-binding status predicted ltEF1gt F8-156/Domain
translation elongation factor Tu homology
ltETUgt F14-21/Region nucleotide-binding motif A
(P-loop) F153-156/Region GTP-binding NKXD
motif F245-330/Domain eEF-1 alpha domain II,
tRNA-binding status predicted ltEF2gt F332-462/Dom
ain eEF-1 alpha domain III, tRNA-binding status
predicted ltEF3gt F36,55,79,165,318/Modified site
N6,N6,N6-trimethyllysine (Lys) status
predicted F301,374/Binding site
glycerylphosphorylethanolamine (Glu) (covalent)
status predicted EFHU1 Length 462 January 14,
2002 1949 Type P Check 5308 .. 1
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE
KFEKE 401 IVDMVPGKPM CVESFSDYPP
LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 351
GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG
PKFLKSGDAA 451 VTKSAQKAQK AK

Write a Comment

User Comments (0)