Biological Databases - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Biological Databases

Description:

An organized body of persistent data and associated computer software for ... http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi. Primer design program: Primer3 ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 38
Provided by: chucks96
Category:

less

Transcript and Presenter's Notes

Title: Biological Databases


1
Biological Databases
  • What types of data are available?
  • What is a database?
  • What are Genbank and Entrez?
  • What does a typical entry look like?
  • How does one use the database?

BIO520 Bioinformatics Jim Lund
2
Biological Data
  • Central Dogma-o-centric
  • Genomic DNA sequence
  • mRNA/cDNA sequence
  • Protein sequence
  • Protein 3D structure
  • Literature (Function)

3
Biological Data
  • Genomic DNA sequence (complete)
  • mRNA/cDNA sequence
  • Gene expression data (NEW)
  • Microarrays, SAGE
  • Expression catalogs
  • Protein sequence
  • Protein interaction/complex data (NEW)
  • Protein 3D structure
  • Literature (Function)
  • Organism databases (NEW)
  • Annotation and classification projects (NEW)

4
What is a Biological Database?
  • An organized body of persistent data and
    associated computer software for updating,
    querying, and retrieving data records.
  • Collection of records and files
  • Organized for a particular purpose
  • The database is separate from the interface and
    can have several interfaces.
  • NCBI Protein can be searched by protein name or
    using BLAST (Basic Local Alignment Search Tool).

5
Common database features
  • Relational Databases
  • Tables
  • Relationships between tables
  • Version Control
  • Consistency enforcement
  • Multiauthor/multiuser with security

6
BIO520 Student Database
Table
  • 2005
  • Name ID Grade
  • Amy 123 A
  • Joe 456 B
  • Sue 789 C

.
Record
Attribute
7
Genbank Entry
LOCUS BC005255 495 bp
mRNA linear PRI 23-JUN-2006 DEFINITION Homo
sapiens insulin, mRNA (cDNA clone IMAGE3950204),
complete cds. ACCESSION BC005255 VERSION
BC005255.1 GI13528923 KEYWORDS MGC. SOURCE
Homo sapiens (human) ORGANISM Homo sapiens
Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Euarchontoglires Primates
Haplorrhini Catarrhini Hominidae
Homo. FEATURES Location/Qualifiers sou
rce 1..495
/organism"Homo sapiens" gene 1..495
/gene"INS"
/db_xref"GeneID3630" CDS 60..392
/gene"INS"
/translation"MALWMRLLPLLALLALWGPDPAAAFVNQHLCGS
HLVEALYLVCG ERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLAL
EGSLQKRGIVEQCCTSICSL
YQLENYCN" ORIGIN 1 agccctccag
gacaggctgc atcagaagag gccatcaagc agatcactgt
ccttctgcca 421 ccgcctcctg caccgagaga
gatggaataa agcccttgaa ccaacaaaaa aaaaaaaaaa
481 aaaaaaaaaa aaaaa //
8
The CORE DDBJ, EMBL, and Genbank
9
Genbank DNA Sequence Database
  • Genbank/EMBL/DDBJ Mirror exchange sequence
    records.
  • Primary vs Secondary Databases
  • nr (non-redundant database)
  • Primary vs secondary records
  • Sequence vs inferred property (coding region)
  • Format vs content

10
Genbank Entry
LOCUS PCU30791 1234 bp mRNA
PLN 31-MAY-1996 DEFINITION Pneumocystis
carinii carinii form 6 guanine nucleotide binding
protein alpha subunit (pcg1) mRNA, complete
cds. ACCESSION U30791 NID
g1345098 VERSION U30791.1 GI1345098
Unique ID Version Control
11
Content-Taxonomy
SOURCE Pneumocystis carinii f. sp. carinii.
ORGANISM Pneumocystis carinii f. sp. carinii
Eukaryota Fungi Ascomycota
Archiascomycetes Pneumocystidaceae Pneumocystis.
12
Reference
REFERENCE 1 (bases 1 to 1234) AUTHORS
Smulian,A.G., Ryan,M., Staben,C. and Cushion,M.
TITLE Signal transduction in Pneumocystis
carinii characterization of the genes (pcg1)
encoding the alpha subunit of the G protein
(PCG1) of Pneumocystis carinii carinii and
Pneumocystis carinii ratti JOURNAL Infect.
Immun. 64 (3), 691-701 (1996) PUBMED 96186460
  • Unique crossreferent
  • Can be gt1 reference

13
Features
FEATURES Location/Qualifiers source 1..1234
/organism"Pneumocystis carinii f. sp.
carinii /strain"Form 6 /note"450 kb
chromosome" /db_xref"taxon38081 5'UTR
1..90 gene 91..1155 /gene"pcg1"
14
CDS
CDS 91..1155 /gene"pcg1 /note"G-protein
alpha subunit" /codon_start1
/product "guanosine nucleotide binding
protein alpha subunit"
/protein_id"AAC49295.1"
/db_xref"PIDg1345099"
/db_xref"GI1345099"
/translation"MGCCFSATYNQDTLRSKEIE
SYLRQEQEHACHEAKILLLGAGES
.
INFERRED
15
DNA
BASE COUNT 421 a 171 c 195 g 447 t ORIGIN
1 tgaattctaa attttatatt 1201 tattttttta
tgctccagat aaaa //
16
Genbank entries
  • Combination of required (LOCUS, SOURCE) and
    optional fields.
  • The entry is hierarchical, some fields contain
    subfields. REFERENCE-gtAUTHORS
  • Some fields can appear multiple times (REFERENCE,
    /gene)
  • Some fields are numerical, other are text. Some
    fields contain free text, others use a controlled
    vocabulary or an database ID.

17
Other Genbank Formats
  • FASTA
  • Simple, little annotation information
  • Easy to use
  • Common denominator format
  • ASN1
  • Computer friendly, human unfriendly
  • XML, INSDSeqXML, TinySeqXML
  • Graph (graphical map of seq features)
  • and more

18
DNA Sequence Files Common formats
  • Genbank (used by VectorNTI)
  • FASTA
  • GCG
  • Accelrys GCG package
  • formerly known as the GCG Wisconsin Package
  • (GCG Genetics Computer Group)
  • Many others!

19
FASTA
One annotation line only!
gtgi1345098gbU30791.1PCU30791
TGAATTCTAAATTTTATATTTCTAATTGCATTTTATATTTTTGATAATAC
TAGATTTATTCCTGGAAACT TAAATTAGTTATTTTAAGTTATGGGATGT
TGTTTTTCTGCTACATATAACCAAGATACACTTCGTTCCAA
20
Submitting sequences to Genbank
  • Sequin
  • Stand-alone sequence submission tool.
  • BankIt
  • Web based sequence submission.

21
Genbank is an ARCHIVE
  • The literature and secondary databases are the
    knowledge sources.
  • There are many additional NCBI annotation
    databases

22
NCBI annotation databases!
  • Genbank -gt RefSeq (Single sequence for each gene)
  • Entrez Gene (Gene-based links to annotation
    sources).
  • HomoloGene (Homologs)
  • OMIM
  • Conserved domains, 3D domains
  • GEO (Gene expression datasets)
  • DNA, protein, 3D structures
  • Interaction data
  • Links to other databases!
  • NCBI Genomes
  • NCBI Map viewer

23
Accessing/Editing DNA files
  • Find DNA Entrez
  • Downloading files
  • Format Conversion
  • Sequence viewing/editing

24
Entrez
  • (Relational database manager)
  • Database searching/browsing
  • Example Pneumocystis G-proteins
  • PCR a cDNA to express in E. coli
  • Read about it and related genes
  • Check similarity to related G-proteins
  • View the 3D structure??
  • http//www.ncbi.nlm.nih.gov/Entrez/

25
Entrez Neighbors-Literature
DNA Protein Structure Genome Popset
Article
Keyword, authors
citation
Article
26
Entrez Neighbors-DNA
citation
DNA
Literature
encoding
BLASTN
DNA
Protein
27
Entrez Neighbors-Protein
3D Structure
citation
citation
Protein
Literature
encoding
BLASTP
Protein
DNA
28
Entrez Neighbors-Structure
Protein
citation
Structure
citation
Literature
VAST
Structure
29
Nucleic Acid Manipulations
  • VectorNTI, eg.

30
File Conversion
  • Readseq
  • Download program
  • http//iubio.bio.indiana.edu/soft/molbio/readseq
  • Use online
  • http//www.ebi.ac.uk/cgi-bin/readseq.cgi
  • http//searchlauncher.bcm.tmc.edu/seq-util/readseq
    .html
  • VectorNTI
  • Other utilities
  • Readseq ----gt

Beware Information Loss
31
Reverse Complementing
5-GAATCA-3
5-TGATTC-3 NOT 5-ACTAAAG-3
32
Sequence Statistics
  • Nucleotide frequencies (di, tri)
  • UV Absorbance
  • MW
  • Tm

33
Restriction Map
  • Linear vs Circular
  • Enzyme sets
  • Which enzymes, where they cut.
  • Gel simulation
  • Gel-to-map MUCH harder!!
  • Useful for
  • Cloning
  • Southern blots
  • Specialized mol bio techniques

34
Translation/ORFs
  • Translation table
  • Standard vs non-standard
  • Frame (1,2,3,4,5,6)
  • Segmental translation (exon-intron)
  • Primary translation vs mature polypeptide

35
Sequence File Editing
  • VectorNTI
  • -Windows editor
  • (eg Word-save as TEXT)
  • Text editor
  • Notepad, Simpletext
  • Wordprocessor
  • vi

MWGTCC IIIIII MWGTCC IIIIII
Nonproportional fonts (courier, monospaced)
36
Plasmids-challenges
  • Parent/child database
  • Dynamic updating
  • Known/unknown segments
  • Heuristic constructions (PCR, Restriction
    digests)
  • No uniform nomenclature
  • No uniform datastructure
  • No public database

VECTOR NTI HELPS
37
Primer design program Primer3
http//frodo.wi.mit.edu/cgi-bin/primer3/primer3_ww
w.cgi
Write a Comment
User Comments (0)
About PowerShow.com