Title: Day 3. Databases
1Bioinformatics
- Day 3. Databases
- BLAST
- Database locations
- Nucleotide databases
- Protein databases
- Transfac database
- Structure databases
- Bibliographic
2(No Transcript)
3Multiple flavours of BLAST
4Databases to search with BLAST
- A truncated list of choices
5The PSI-BLAST algorithm
- Probe each sequence in database for local
regions of similarity
- Collect significant hits
- Construct multiple sequence alignment table
between query sequence and significant local
matches
- Form a profile from the multiple alignment
- Re-probe the database with the profile, looking
only for local matches
- Retain statistically significant hits
- Go back to step 2, until the result does not
change (iteration)
6Databases
- Nucleic acid sequences, including whole-genome
projects
- Amino acid sequences of proteins
- Protein and nucleic acid structures
- Small-molecule crystal structures
- Protein functions
- Expression patterns of genes
- Publications
7Database locations
- National Center for Biotechnology Information
(NCBI)
- European Bioinformatics Institute (EBI)
- DNA databank of Japan (DDBJ)
8Nucleic acid sequence databases
9Database entries
LOCUS BTBPTIG 3998 bp
DNA linear MAM 17-NOV-2004
DEFINITION Bovine pancreatic trypsin inhibitor
(BPTI) gene. ACCESSION X03365 K00966 VERSION
X03365.1 GI142 KEYWORDS Alu-like repetiti
ve sequence protease inhibitor trypsin
inhibitor. SOURCE Bos taurus (ca
ttle) ORGANISM Bos taurus Eukaryo
ta Metazoa Chordata Craniata Vertebrata
Euteleostomi Mammalia Eutheria Lau
rasiatheria Cetartiodactyla Ruminantia
Pecora Bovidae Bovinae Bos.
REFERENCE 1 AUTHORS Anderson,S. and Kingsto
n,I.B. TITLE Isolation of a genomic clone f
or bovine pancreatic trypsin inhibito
r by using a unique-sequence synthetic DNA probe
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 80
(22), 6838-6842 (1983) PUBMED 6580617 REFER
ENCE 2 (bases 1 to 3998) AUTHORS Kingston,
I.B. and Anderson,S. TITLE Sequences encodi
ng two trypsin inhibitors occur in strikingly
similar genomic environments
JOURNAL Biochem. J. 233 (2), 443-450 (1986)
PUBMED 2420326 COMMENT Data kindly revie
wed (08-DEC-1987) by Kingston I.B.
FEATURES Location/Qualifiers
source 1..3998
/organism"Bos taurus" /mol
_type"genomic DNA" /db_xref
"taxon9913" misc_feature 795..800
/note"pot. polyA signal"
CDS 2685
/note"unnamed protein product trypsin
inhibitor (aa 1-58)"
/codon_start1
/protein_id"CAA27063.1" /
db_xref"GI1364184"
10The sequence
ORIGIN 1 aattctgata atgcagagaa ctggtaagga gttc
tgattg ttctgcttga ttaaatgggt 61 tgtaacagga tagtg
tcttg tcctgatcct agcattcata tggtgtgtgt tctggggcaa
121 gtcatctgca gtttcttcac ctgaacaggg ggaccaggtt
acatgagttt sequence deleted actttggggt g
tgttatttc cctgaatt //
11Download sequence in FASTA format
gi142embX03365.1BTBPTIG Bovine pancreatic
trypsin inhibitor (BPTI) gene AATTCTGATAATGCAGAGA
ACTGGTAAGGAGTTCTGATTGTTCTGCTTGATTAAATGGGTTGTAACAGG
A TAGTGTCTTGTCCTGATCCTAGCATTCATATGGTGTGTGTTCTGGGG
CAAGTCATCTGCAGTTTCTTCAC CTGAACAGGGGGACCAGGTTACATG
AGTTTCTTAAAAGATTACCAGTCATGAGTATGAAGAGTTTACACT
TTCCTGATCAATGACGTCCATTTCCCATCAAAATATTTTAGTCCAAAAGA
CTCATCTATCTAATGTAGAT CATTTTCTCACCACCCCTCTAAAAAATT
TATCTTTCAGATATGATCATTTCTCTATTATGAAATTAATCA
GAGAGTTGAGTGACAGCTGAGTGTCTTCCCTCCAAAGGCAACTGCAGGAA
GAGCAAGAAATGCAATACTT TTCTATGAGTTTGCTCGTGGGGCCAAGA
CTGCTTTTTCCAGGCTGGTACAATAGTAATCAAATCTCAAAG
ATATTCTTCTTTCCTCCTGGCCAGACTATTATTTTATTTTCCTATCAAGA
TATAGAAAGTTAGAAGTAGA CTCATAATTATATAGGCAGGCCTCATCA
TCAAATAGACTAACAAGAATTTTATTTTATCTGCCTTTTCAA
TGACTGTGCACTTGGCATGAGGATGAAATGGGAGATTTATTCCCTTGATA
AATATTCATGAAATACTTAT GCTTTTTGTCCCTAAAAAGCATATTTCT
TGATATAGGAAAACAGCTGTAAACAAAAGGTAGTAAAATAAT
deleted GTAGAATTTCCATCATCGAGTTTTCAGCTCAGTGG
TGGGAGAGGTCTTTTCATGAACGAAACCTCCTCCT
CACATTGATTTGAAGGTCTGTGGCTTCAAAGAGTCTGGCCTTATCTTTAA
ATAAATTCATATTTTAATTA AACTAACTGGAGTGGATTGTGTTGTTTG
CAACTAAGAACCTTAACCCATAGGTTCCATGGAAACGGTGGT
CTTTCTCATTTTATGCAGATGGGTGGGCAGCTCTCCATCACCTCTCCTCA
GACTCAGCCCTACCAAGTAG AAGGAGCCAACCCCTTACACTGACATCT
ACCTCTTATGGCCGTGCCAGTGTACATGAAAAACTGGATGAG
AGACACCTCAACAAGAAAACTTTTGTCCTTCACTTCTTGGGCCAGGTCAA
ACTTTGGGGTGTGTTATTTC CCTGAATT
12FASTA file format
- A sequence in FASTA format
- begins with a single-line description
- followed by lines of sequence data
- description line ends with the greater-than
("") symbol
- lines of text should be shorter than 80
characters
gi532319pirTVFV2ETVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLL
LNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVT
IMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGT
NDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSV
IIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQI
ESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXX
XXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
13Protein sequence databases
SWISS-PROT (Swiss Institute of Bioinformatics and
EMBL) PIR (George Washington University) MIPS (M
unich Information Center for Protein Sequences)
- Explore and see what they offer
14SWISS-PROT
15Protein Information Resource (PIR)
16The Transfac database transcription factors
17The Transfac SITE file entry
AC Accession no. ID Identifier DT
Date authorTY Sequence type
DE Description (gene or gene product)
GENE accession no. RE Gene region (e. g. p
romoter, enhancer) SQ Sequence of the regu
latory element EL Denomination of the elem
ent SF First position of factor binding si
te ST Last position of factor binding site
S1 Definition of first position (if not t
ranscription start site) BF Binding factor
(FACTOR accession no. name quality biological
species) MX Deduced matrix (MATRIX accessi
on no. identifier) OS Organism species O
C Organism classification
SO Factor source (TRANSFAC CELL accession
no. name) MM Method CC Comments
DR External databases (EPD, Flybase,
TRANSCompel, TRANSPRO, PathoDB)
DR EMBL accession no. identifier
(firstlast position of the TRANSFAC se
quence element) RX MEDLINE ID RN
Reference no. RA Reference authors RT
Reference title RL Reference data //
18The FACTOR file entries
AC Accession no. ID Identifier DT
Date author FA Factor name SY
Synonyms OS Species OC Biologi
cal classification (taxonomy) GE Encoding
gene HO Homologs (suggested) CL Cl
assification (class accession no. class
identifier decimal classification numb
er.) SZ Size (length (number of amino acid
s) calculated molecular mass in kDa
experimental molecular mass (or range)
in kDa (experimental method) Ref
SQ Sequence SC Sequence comment, i.
e. source of the protein sequence
FT Feature table (1st position last
position feature) SF Structural fe
atures CP Cell specificity (positive) CN
Cell specificity (negative)
EX Expression pattern organ,
cell name, system, developmental stage relative
level of expression (very high, high, m
edium, low, very low, detectable or none)
detection method molecule type
detected, i.e. RNA or protein reference
FF Functional features IN Interacti
ng factors (factor accession no. factor name
biological species.) MX Matrix (MATRIX acc
ession no. identifier) BS Binding SITE ac
cession no. SITE ID Quality N short
description, GENE accession no. biolog
ical species deleted //
19Searching Transfac
TBP TBP TBP
GCN4 GAL4 GAL4
GAL4 BAF1 AP-1 DBF-A DBF-A
DBF-A
10 20 3
0 40 50 60 70
TTCTCATGTT TGACGAGCTT ATCATCGATA AGCTTTAATG
CGGTAGTTTA TCACAGTTAA ATTGCTAACG
4 11 19 27
48 11 19 27
48 19 27
48
GAL4 AP-1
AP-1 AP-1
GCN4 GCN4
GCR1 RAF BAF1 GCR1
80 90 100 110
120 130 140 CAGTCAGGCA CCGTGTAT
GA AATCTAACAA TGCGCTCATC GTCATCCTCG GCACCGTCAC
CCTGGATGCT
71 109 113
118 125 134 73
111 115 126
20PDB structure file
HEADER HYDROLASE
05-MAY-00 1EY0
TITLE STRUCTURE OF WILD-TYPE S. NUCLEASE AT
1.6 A RESOLUTION
COMPND MOL_ID 1
COMPND 2 MOLECULE STAPHYLOCOCCAL NUCLEASE
COMPND 3 CHAIN A
COMPND 4 EC 3.1.31.1
COMPND 5 ENGINEERED YES
SOURCE MOL_ID 1
SOURCE 2 ORGANISM_SCIENTIFIC STAPHYLOCOCCUS
AUREUS
SOURCE 3 ORGANISM_COMMON BACTERIA
SOURCE 4 STRAIN FOGGI
SOURCE 5 EXPRESSION_SYSTEM ESCHERICHIA COLI
SOURCE 6 EXPRESSION_SYSTEM_COMMON BACTERIA
KEYWDS HYDROLASE
EXPDTA X-RAY DIFFRACTION
AUTHOR J.CHEN,Z.LU,J.SAKON,W.E.STITES
REVDAT 1 18-OCT-00 1EY0 0
JRNL AUTH J.CHEN,Z.LU,J.SAKON,W.E.STITES
JRNL TITL INCREASING THE THERMOSTABILITY
OF STAPHYLOCOCCAL
JRNL TITL 2 NUCLEASE IMPLICATIONS FOR THE
ORIGIN OF PROTEIN
JRNL TITL 3 THERMOSTABILITY
JRNL REF J.MOL.BIOL.
V. 303 125 2000
21PDB structure file
HELIX 1 1 TYR A 54 ASN A 68 1
15
HELIX 2 2 VAL A 99 GLN A 106 1
8
HELIX 3 3 HIS A 121 GLU A 135 1
15
HELIX 4 4 LEU A 137 SER A 141 5
5
SHEET 1 A 7 LYS A 97 MET A 98 0
SHEET 2 A 7 GLY A 88 ALA A 94 -1 N ALA
A 94 O LYS A 97
SHEET 3 A 7 ILE A 72 PHE A 76 -1 O GLU
A 73 N TYR A 93
SHEET 4 A 7 LYS A 9 ALA A 17 -1 O GLU
A 10 N VAL A 74
SHEET 5 A 7 THR A 22 TYR A 27 -1 N LYS
A 24 O LYS A 16
SHEET 6 A 7 GLN A 30 LEU A 36 -1 O GLN
A 30 N TYR A 27
SHEET 7 A 7 GLY A 88 ALA A 94 1 O GLY
A 88 N ARG A 35
SHEET 1 B 2 VAL A 39 ASP A 40 0
SHEET 2 B 2 LYS A 110 VAL A 111 -1 O LYS
A 110 N ASP A 40
ATOM 1 N LYS A 6 63.582 22.010
-9.585 1.00 67.45 N
ATOM 2 CA LYS A 6 63.469 22.366
-8.175 1.00 98.38 C
ATOM 3 C LYS A 6 62.720 23.683
-7.978 1.00 57.15 C
ATOM 4 O LYS A 6 63.134 24.746
-8.431 1.00 61.37 O
ATOM 5 CB LYS A 6 64.851 22.447
-7.522 1.00 60.18 C
ATOM 6 CG LYS A 6 65.999 22.092
-8.457 1.00151.86 C
ATOM 7 CD LYS A 6 66.725 20.834
-8.012 1.00157.91 C
ATOM 8 CE LYS A 6 68.170 20.803
-8.483 1.00156.93 C
ATOM 9 NZ LYS A 6 68.991 19.821
-7.717 1.00 62.55 N
END
22PDB file field codes
ATOM 1 N LYS A 6 63.582 22.010
-9.585 1.00 67.45 N
--------------------------------------------------
------------------------- Field Column
FORTRAN
No. range format Description
--------------------------------------------------
------------------------- 1. 1 - 6
A6 Record ID (eg ATOM, HETATM)
2. 7 - 11 I5 Atom serial
number
- 12 - 12 1X Blank
3. 13 - 16
A4 Atom name (eg " CA " , " ND1")
4. 17 - 17 A1 Alternative
location code (if any)
5. 18 - 20 A3 Standard
3-letter amino acid code for residue
- 21 - 21 1X Blank
6. 22 - 22
A1 Chain identifier code
7. 23 - 26 I4 Resid
ue sequence number
8. 27 - 27 A1 Insertion code
(if any)
- 28 - 30 3X Blank
9. 31 - 38
F8.3 Atom's x-coordinate
10. 39 - 46 F8.3 Atom's
y-coordinate
11. 47 - 54 F8.3 Atom's
z-coordinate
12. 55 - 60 F6.2 Occupancy value
for atom
13. 61 - 66 F6.2 B-value (thermal
factor) - 67 - 67
1X Blank
14. 68 - 70 I3 Footnot
e number
--------------------------------------------------
-------------------------
23Bibliographic databases
24Entrez
- pronounced áhn-trey
- from the french meaning enter
25Searching with Entrez
- Enter Surname Initials (no commas)
- Enter Surname1 Initials1 Surname2 Initial2 (no
commas)
- Can search full names, e.g. Robert Simpson
- If searching for ambiguous names, e.g. Ryan
James, enter james, ryan
- Click Go
26Search Results
- Authors
- Title
- Journal
- Volume
- Page numbers
- Link
27Choosing what to display
28MEDLINE database flat file format
29Choosing file destination
- Import into bibliographic reference database
30National Bioinformatics Network