Databases in Bioinformatics - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Databases in Bioinformatics

Description:

Low quality sequences generated by high -volume sequencing the 3' or 5' end of cDNAs ... www.expassy.org/prosite) and PRINTS (http://bioinf.man.ac.uk/dbbrowser/PRINTS ... – PowerPoint PPT presentation

Number of Views:311
Avg rating:3.0/5.0
Slides: 57
Provided by: curt49
Category:

less

Transcript and Presenter's Notes

Title: Databases in Bioinformatics


1
Databases in Bioinformatics
  • Mark Pallen
  • With thanks to Arshad Khan and Jonathan Pevsner

2
Databases in Bioinformatics
  • Sequence databases
  • Sequence analysis
  • Functional genomics
  • Literature databases
  • Structural databases
  • Metabolic pathway databases
  • Specialised databases

3
The definitive source.
  • http//nar.oxfordjournals.org/content/vol34/suppl_
    1/index.dtl

4
DNA Sequence databases
  • Main repositories
  • GenBank (US)
  • (http//www.ncbi.nlm.nih.gov/Genbank/index.html)
  • EMBL (Europe)
  • (http//www.ebi.ac.uk/embl/)
  • DDBJ (Japan)
  • (http//www.ddbj.nig.ac.jp/)
  • Primary databases
  • DNA sequences are identical

5
(No Transcript)
6
www.ncbi.nlm.nih.gov
7
  • PubMed is
  • National Library of Medicine's search service
  • gt14 million citations in MEDLINE
  • links to participating online journals
  • PubMed tutorial (via side bar)

8
  • Entrez integrates
  • the scientific literature
  • DNA and protein sequence databases
  • 3D protein structure data
  • population study data sets
  • assemblies of complete genomes

9
Entrez is a search and retrieval system that
integrates NCBI databases
10
  • OMIM is
  • Online Mendelian Inheritance in Man
  • catalog of human genes and genetic disorders
  • edited by Dr. Victor McKusick, others at JHU

11
  • Taxonomy Browser is
  • browser for the major divisions of living
    organisms
  • (archaea, bacteria, eukaryota, viruses)
  • taxonomy information such as genetic codes
  • molecular data on extinct organisms

12
  • Structure site includes
  • Molecular Modelling Database (MMDB)
  • biopolymer structures obtained from
  • the Protein Data Bank (PDB)
  • Cn3D (a 3D-structure viewer)
  • vector alignment search tool (VAST)

13
How can I use PubMed at NCBI to find
literature information?
14
PubMed is the NCBI gateway to MEDLINE. MEDLINE
contains bibliographic citations and author
abstracts from over 4,000 journals published in
the United States and in 70 foreign countries.
It has 12 million records dating back to 1966.
15
MeSH is the acronym for "Medical Subject
Headings." MeSH is the list of the vocabulary
terms used for subject analysis of biomedical
literature at NLM. MeSH vocabulary is used for
indexing journal articles for MEDLINE. The
MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical
literature.
16
(No Transcript)
17
(No Transcript)
18
PubMed search strategies
  • Try the tutorial on the left sidebar
  • Use boolean queries
  • lipocalin AND disease
  • Try using limits
  • Try LinkOut to find external resources

19
lipocalin AND disease (96 results)
1 AND 2
1
2
lipocalin OR disease (1.9 million results)
1 OR 2
1
2
lipocalin NOT disease (729 results)
1 NOT 2
1
2
20
Fulltext Literature Databases
  • Highwire
  • Google Scholar
  • Google Print
  • Useful for finding information about genes buried
    in tables in papers, invisible to PubMed

21
(No Transcript)
22
(No Transcript)
23
What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
24
How can I use NCBI (or other sites) to find
information about a protein or gene?
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
FASTA format
31
Graphics format
32
Question 4 How can I find information about
a particular disease?
Answer Try OMIM
33
(No Transcript)
34
(No Transcript)
35
Sequence Databases
  • Annotated sequence databases
  • SWISS-PROT, GenBank etc
  • Usage identifying function, retrieving
    information
  • Low-annotation sequence databases
  • EST databases, high-throughput genome sequences
  • Usage discovery of new genes

36
General Protein Databases
  • SWISS-PROT
  • Manually curated
  • high-quality annotations, less data
  • GenPept/TREMBL
  • Translated coding sequences from GenBank/EMBL
  • Few annotations, more up to date
  • PIR
  • Phylogenetic-based annotations
  • All 3 now combining efforts to form UniProt
    (http//www.uniprot.org)

37
Low-annotation Databases
  • ESTs (Expressed Sequence Tags)
  • Low quality sequences generated by high -volume
    sequencing the 3 or 5 end of cDNAs
  • High-throughput genome sequences
  • Produced by mass-sequencing of genomic DNA

38
Non-redundant Databases
  • Sequence data only cannot be browsed, can only
    be searched using a sequence
  • Combine sequences from more than one database
  • Examples
  • NR Nucleic (genbankEMBLDDBJPDB DNA)
  • NR Protein (SWISS-PROTTrEMBLGenPeptPDB protein)

39
Sequence Structure Databases
  • PDB (Protein Databank)
  • Stores 3-dimensional atomic coordinates for
    biological molecules including protein and
    nucleic acids
  • Data obtained by X-ray crystallography, NMR, or
    computer modeling
  • http//www.rcsb.org/pdb/
  • MMDB (Molecular Modeling database)
  • Over 28,000 3D macromolecular structures,
    including proteins and polynucleotides
  • (http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?db
    Structure)
  • SCOP (Structural Classification of Proteins)
  • Classification of proteins according to
    structural and evolutionary relationships

40
File Formats
  • GenBank/GB, genbank flatfile format
  • NBRF format
  • EMBL, EMBL flatfile format
  • Swissprot
  • GCG, single sequence format of GCG software
  • DNAStrider, for common Mac program
  • Pearson/Fasta, a common format used by Fasta
    programs and others
  • Phylip3.2, sequential format for Phylip programs
  • Phylip, interleaved format for Phylip programs
    (v3.3, v3.4)
  • Plain/Raw, sequence data only (no name,
    document, numbering)
  • MSF multi sequence format used by GCG software
  • PAUP"s multiple sequence (NEXUS) format
  • ASN.1 format used by NCBI

41
EMBL Format
ID TRBG361 standard mRNA PLN 1859 BP.XXAC
X56734 S46826XXSV X56734.1XXDT 12-SEP-1991
(Rel. 29, Created)DT 15-MAR-1999 (Rel. 59, Last
updated, Version 9)XXDE Trifolium repens mRNA
for non-cyanogenic beta-glucosidaseXXKW
beta-glucosidase.XXOS Trifolium repens (white
clover)OC Eukaryota Viridiplantae
Streptophyta Embryophyta TracheophytaOC
Spermatophyta Magnoliophyta eudicotyledons
core eudicots rosidsOC eurosids I Fabales
Fabaceae Papilionoideae Trifolieae
Trifolium.XXRN 5RP 1-1859RX MEDLINE
91322517.RX PUBMED 1907511.RA Oxtoby E., Dunn
M.A., Pancoro A., Hughes M.A.RT "Nucleotide and
derived amino acid sequence of the cyanogenicRT
beta-glucosidase (linamarase) from white clover
(Trifolium repens L.)."RL Plant Mol. Biol.
17(2)209-219(1991).XXRN 6RP 1-1859RA
Hughes M.A.RT RL Submitted (19-NOV-1990) to
the EMBL/GenBank/DDBJ databases.RL M.A. Hughes,
UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL
SCHOOL, NEW CASTLERL UPON TYNE, NE2 4HH,
UKXXDR GOA P26204.DR MENDEL 11000
Trirp116211000.DR SWISS-PROT P26204
BGLS_TRIRP.XX
FH Key Location/QualifiersFHFT source
1..1859FT /db_xref"taxon3899"FT
/mol_type"mRNA"FT /organism"Trifolium
repens"FT /tissue_type"leaves"FT
/clone_lib"lambda gt10"FT /clone"TRE361"FT
CDS 14..1495FT /db_xref"GOAP26204"FT
/db_xref"SWISS-PROTP26204"FT
/note"non-cyanogenic"FT /EC_number"3.2.1.21"FT
/product"beta-glucosidase"FT
/protein_id"CAA40058.1"FT /translation"MDFIVAIF
ALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFIFT
FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHR
YKEDVGIMKFT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNL
INELLANGIQPFVTLFHWDLPQFT VLEDEYGGFLNSGVINDFRDYTDL
CFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGRFT
CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITL
VSNWLMPLDFT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRI
VKNRLPKFSKFESSLVNGSFDFFT IGINYYSSSYISNAPSHGNAKPSY
STNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQFT
EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYY
RHLYYIRSAFT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
FT mRNA 1..1859FT /evidenceEXPERIMENTALXXSQ
Sequence 1859 BP 609 A 314 C 355 G 581 T 0
otheraaacaaacca aatatggatt ttattgtagc
catatttgct ctgtttgtta ttagctcatt 60cacaattact
tccacaaatg cagttgaagc ttctactctt cttgacatag
gtaacctgag 120tcggagcagt tttcctcgtg gcttcatctt
tggtgctgga tcttcagcat accaatttga 180aggtgcagta
aacgaaggcg gtagaggacc aagtatttgg gataccttca
cccataaata 240tccagaaaaa ataagggatg gaagcaatgc
agacatcacg gttgaccaat atcaccgcta 300caaggaagat
gttgggatta tgaaggatca aaatatggat tcgtatagat
tctcaatctc 360ttggccaaga atactcccaa agggaaagtt
gagcggaggc ataaatcacg aaggaa
http//www.ebi.ac.uk/embl/Documentation/User_manua
l/usrman.html
42
Genbank Format
LOCUS SCU49845 5028 bp DNA
PLN 21-JUN-1999 DEFINITION Saccharomyces
cerevisiae TCP1-beta gene, partial cds, and
Axl2p (AXL2) and Rev7p (REV7) genes,
complete cds. ACCESSION U49845 VERSION
U49845.1 GI1293613 KEYWORDS . SOURCE
Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota Fungi Ascomycota Saccharomycotina
Saccharomycetes Saccharomycetales
Saccharomycetaceae Saccharomyces. REFERENCE 1
(bases 1 to 5028) AUTHORS Torpey,L.E.,
Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE
Cloning and sequence of REV7, a gene whose
function is required for DNA
damage-induced mutagenesis in Saccharomyces
cerevisiae JOURNAL Yeast 10 (11), 1503-1509
(1994) MEDLINE 95176709 PUBMED
7871890 REFERENCE 2 (bases 1 to 5028)
AUTHORS Roemer,T., Madden,K., Chang,J. and
Snyder,M. TITLE Selection of axial growth
sites in yeast requires Axl2p, a novel
plasma membrane glycoprotein JOURNAL Genes
Dev. 10 (7), 777-793 (1996) MEDLINE 96194260
PUBMED 8846915 REFERENCE 3 (bases 1 to
5028) AUTHORS Roemer,T. TITLE Direct
Submission JOURNAL Submitted (22-FEB-1996)
Terry Roemer, Biology, Yale University, New
Haven, CT, USA FEATURES
Location/Qualifiers source 1..5028
/organism"Saccharomyces
cerevisiae"
/db_xref"taxon4932"
/chromosome"IX"
gene 687..3158
/gene"AXL2" CDS 687..3158
/gene"AXL2"
/note"plasma membrane glycoprotein"
/codon_start1
/function"required for axial budding pattern of
S. cerevisiae"
/product"Axl2p"
/protein_id"AAA98666.1"
/db_xref"GI1293615"
/translation"MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPP
VARVNESF TFQISNDTYKSSVDKTAQIT
YNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALL
KNYGYTNGKNALKLDPNE
VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPV
INSAIAPE TSYSFVIIATDIEGFSAVEV
EFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELL
GKNSNPANFSVSIYDTYG
DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVS
LEFTNSSQ DHDWVKFQSSNLTLAGEVPK
NFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAA
LPAANKTSSHNKKAVAIA
CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPN
QENATPLN NPFDDDASSYDDTSIARRLA
ALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNK
SWRYTGNLSPVSDIVRDS
YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVT
PSPYNVTK HRNRHLQNIQDSQSGKNGIT
PTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
VDFSNKSNVNVGQVKDIHGRIPEML BASE COUNT
1510 a 1074 c 835 g 1609 t ORIGIN
1 gatcctccat atacaacggt atctccacct
caggtttaga tctcaacaac ggaaccattg 61
ccgacatgag acagttaggt atcgtcgaga gttacaagct
aaaacgagca gtagtcagct 121 ctgcatctga
agccgctgaa gttctactaa gggtggataa catcatccgt
gcaagaccaa 181 gaaccgccaa tagacaacat
atgtaacata tttaggatat acctcgaaaa taataaaccg
241
http//www.ncbi.nlm.nih.gov/Sitemap/samplerecord.h
tml
43
Swissprot format
http//us.expasy.org/sprot/userman.html
44
Specialized Sequence Databases
  • Focus on a specific type of sequences
  • Sequences are often modified or specially
    annotated
  • Usage depends on the database
  • Examples
  • Ribosomal RNA databases
  • Immunology databases

45
Protein domain databases
  • Pfam (http//www.sanger.ac.uk/Software/Pfam/)
  • Collection of multiple sequence alignments and
    hidden Markov models covering many common protein
    domains and families
  • SMART (a Simple Modular Architecture Research
    Tool)
  • Identification and annotation of genetically
    mobile domains and the analysis of domain
    architectures
  • (http//smart.embl-heidelberg.de/help/smart_about.
    shtml
  • CDD (http//www.ncbi.nlm.nih.gov/Structure/cdd/wr
    psb.cgi)
  • Combines SMART and Pfam databases
  • Easier and quicker search

46
Sequence Motif Databases
  • Scan Prosite (http//www.expassy.org/prosite) and
    PRINTS (http//bioinf.man.ac.uk/dbbrowser/PRINTS/)
  • Store conserved motifs occurring in nucleic acid
    or protein sequences
  • Motifs can be stored as consensus sequences,
    alignments, or using statistical representations
    such as residue frequency tables

47
Ribosomal RNA Databases
  • RDP (Michigan State University, USA)
  • http//rdp.cme.msu.edu/html/
  • rRNA database (University of Antwerp, Belgium)
  • http//rrna.uia.ac.be/
  • ribosomal RNA sequences are pre-aligned according
    to their secondary structure
  • Usage creating data sets for molecular
    phylogeny, especially for microbial taxonomy and
    identification

48
Immunological Sequence Databases
  • The Kabat Database of Sequences of Proteins of
    Immunological Interest
  • www.hgmp.mrc.ac.uk/Bioinformatics/Databases/kabatp
    -help.html
  • Sequences are classified according to antigen
    specificity, and available in pre-aligned format
  • The Immunogenetics database (IMGT)
  • http//imgt.cnusc.fr8104/
  • Focuses on immunoglobulins, T-cell receptors and
    MHC genes

49
Genome Databases
  • Focus on one organism or group of organisms
  • Examples
  • Colibase (E. coli and related species)
  • http//colibase.bham.ac.uk/
  • GDB (human)
  • http//www.gdb.org/
  • Flybase (Drosophila)
  • http//flybase.bio.indiana.edu/
  • WormBase (C. elegans)
  • http//wormbase.org
  • AtDB (Arabidopsis)
  • http//www.arabidopsis.org
  • SGD (S. cerevisiae)
  • http//genome-www.stanford.edu/Saccharomyces/

50
Expression Databases
  • RNA expression
  • Results of microarray experiments measuring the
    change in specific mRNA content under certain
    conditions
  • Array Express (EBI) and Geo (NCBI)
  • Not user friendly
  • Proteome databases
  • 2D gel electrophoresis images representing the
    protein content of a cell or tissue under
    specific conditions
  • SWISS 2D PAGE at http//us.expasy.org/ch2d/

51
Other Database Types
  • Literature
  • MEDLINE (http//ncbi.nlm.nih.gov/PubMed/)
  • HighWire (http//www.highwire.org)
  • Variation
  • dbSNP (http//ncbi.nlm.nih.gov/SNP/)
  • HGBase (http//hgbase/interactiva/de)
  • Metabolic pathways
  • KEGG (http//kegg.genome.ad.jp/kegg/)
  • WIT (http//wit.mcs/anl.gov/WIT2)
  • Organisms and nomenclature
  • Taxonomies (e.g. http//ncbi.nlm.nih.gov/Taxonomy
    / )
  • Mendel (http//mbclserver.rutgers.edu/CPGN)

52
Methods for Accessing Data
  • local installation
  • screen scraping
  • BioPerl
  • FTP sites
  • DAS

53
Local Installations
  • SRS
  • Need to obtain license from Lion Biosceinces
  • Download data from FTP sites
  • Ensembl
  • "framework to organise biology around the
    sequences of large genomes"
  • www.ensembl.org

54
Screen Scraping
  • URL spoofing
  • construction of URLs that replicate the query
  • html parsing
  • extraction of results from html pages returned by
    query
  • Requirements
  • html module
  • knowlege of query mechanism
  • Method NOT advocated by most data providers

55
BioPerl
  • BioPerl is a collection of modules that
    facilitates the development of Perl scripts for
    bioinformatics applications.
  • www.bioperl.org

56
ReadSeq
  • Converts input DNA/AA sequence to specified
    format
  • Usage
  • readseq my1st.seq my2nd.seq -all - formatgenbank
    -outputmy.gb
  • Online Manual
  • http//www.psc.edu/general/software/packages/reads
    eq/manual.html
Write a Comment
User Comments (0)
About PowerShow.com