Day 3. Databases - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Day 3. Databases

Description:

Form a profile from the multiple alignment. Re-probe the database with the profile, looking only for local matches ... SOURCE Bos taurus (cattle) ORGANISM Bos taurus ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 31
Provided by: uov
Category:

less

Transcript and Presenter's Notes

Title: Day 3. Databases


1
Bioinformatics
  • Day 3. Databases
  • BLAST
  • Database locations
  • Nucleotide databases
  • Protein databases
  • Transfac database
  • Structure databases
  • Bibliographic

2
(No Transcript)
3
Multiple flavours of BLAST
4
Databases to search with BLAST
  • A truncated list of choices

5
The PSI-BLAST algorithm
  • Probe each sequence in database for local
    regions of similarity
  • Collect significant hits
  • Construct multiple sequence alignment table
    between query sequence and significant local
    matches
  • Form a profile from the multiple alignment
  • Re-probe the database with the profile, looking
    only for local matches
  • Retain statistically significant hits
  • Go back to step 2, until the result does not
    change (iteration)

6
Databases
  • Nucleic acid sequences, including whole-genome
    projects
  • Amino acid sequences of proteins
  • Protein and nucleic acid structures
  • Small-molecule crystal structures
  • Protein functions
  • Expression patterns of genes
  • Publications

7
Database locations
  • National Center for Biotechnology Information
    (NCBI)
  • European Bioinformatics Institute (EBI)
  • DNA databank of Japan (DDBJ)

8
Nucleic acid sequence databases
9
Database entries
LOCUS BTBPTIG 3998 bp
DNA linear MAM 17-NOV-2004
DEFINITION Bovine pancreatic trypsin inhibitor
(BPTI) gene. ACCESSION X03365 K00966 VERSION
X03365.1 GI142 KEYWORDS Alu-like repetiti
ve sequence protease inhibitor trypsin
inhibitor. SOURCE Bos taurus (ca
ttle) ORGANISM Bos taurus Eukaryo
ta Metazoa Chordata Craniata Vertebrata
Euteleostomi Mammalia Eutheria Lau
rasiatheria Cetartiodactyla Ruminantia
Pecora Bovidae Bovinae Bos.
REFERENCE 1 AUTHORS Anderson,S. and Kingsto
n,I.B. TITLE Isolation of a genomic clone f
or bovine pancreatic trypsin inhibito
r by using a unique-sequence synthetic DNA probe
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 80
(22), 6838-6842 (1983) PUBMED 6580617 REFER
ENCE 2 (bases 1 to 3998) AUTHORS Kingston,
I.B. and Anderson,S. TITLE Sequences encodi
ng two trypsin inhibitors occur in strikingly
similar genomic environments
JOURNAL Biochem. J. 233 (2), 443-450 (1986)
PUBMED 2420326 COMMENT Data kindly revie
wed (08-DEC-1987) by Kingston I.B.
FEATURES Location/Qualifiers
source 1..3998
/organism"Bos taurus" /mol
_type"genomic DNA" /db_xref
"taxon9913" misc_feature 795..800
/note"pot. polyA signal"
CDS 2685
/note"unnamed protein product trypsin
inhibitor (aa 1-58)"
/codon_start1
/protein_id"CAA27063.1" /
db_xref"GI1364184"
10
The sequence
ORIGIN 1 aattctgata atgcagagaa ctggtaagga gttc
tgattg ttctgcttga ttaaatgggt 61 tgtaacagga tagtg
tcttg tcctgatcct agcattcata tggtgtgtgt tctggggcaa
121 gtcatctgca gtttcttcac ctgaacaggg ggaccaggtt
acatgagttt sequence deleted actttggggt g
tgttatttc cctgaatt //
11
Download sequence in FASTA format
gi142embX03365.1BTBPTIG Bovine pancreatic
trypsin inhibitor (BPTI) gene AATTCTGATAATGCAGAGA
ACTGGTAAGGAGTTCTGATTGTTCTGCTTGATTAAATGGGTTGTAACAGG
A TAGTGTCTTGTCCTGATCCTAGCATTCATATGGTGTGTGTTCTGGGG
CAAGTCATCTGCAGTTTCTTCAC CTGAACAGGGGGACCAGGTTACATG
AGTTTCTTAAAAGATTACCAGTCATGAGTATGAAGAGTTTACACT
TTCCTGATCAATGACGTCCATTTCCCATCAAAATATTTTAGTCCAAAAGA
CTCATCTATCTAATGTAGAT CATTTTCTCACCACCCCTCTAAAAAATT
TATCTTTCAGATATGATCATTTCTCTATTATGAAATTAATCA
GAGAGTTGAGTGACAGCTGAGTGTCTTCCCTCCAAAGGCAACTGCAGGAA
GAGCAAGAAATGCAATACTT TTCTATGAGTTTGCTCGTGGGGCCAAGA
CTGCTTTTTCCAGGCTGGTACAATAGTAATCAAATCTCAAAG
ATATTCTTCTTTCCTCCTGGCCAGACTATTATTTTATTTTCCTATCAAGA
TATAGAAAGTTAGAAGTAGA CTCATAATTATATAGGCAGGCCTCATCA
TCAAATAGACTAACAAGAATTTTATTTTATCTGCCTTTTCAA
TGACTGTGCACTTGGCATGAGGATGAAATGGGAGATTTATTCCCTTGATA
AATATTCATGAAATACTTAT GCTTTTTGTCCCTAAAAAGCATATTTCT
TGATATAGGAAAACAGCTGTAAACAAAAGGTAGTAAAATAAT
deleted GTAGAATTTCCATCATCGAGTTTTCAGCTCAGTGG
TGGGAGAGGTCTTTTCATGAACGAAACCTCCTCCT
CACATTGATTTGAAGGTCTGTGGCTTCAAAGAGTCTGGCCTTATCTTTAA
ATAAATTCATATTTTAATTA AACTAACTGGAGTGGATTGTGTTGTTTG
CAACTAAGAACCTTAACCCATAGGTTCCATGGAAACGGTGGT
CTTTCTCATTTTATGCAGATGGGTGGGCAGCTCTCCATCACCTCTCCTCA
GACTCAGCCCTACCAAGTAG AAGGAGCCAACCCCTTACACTGACATCT
ACCTCTTATGGCCGTGCCAGTGTACATGAAAAACTGGATGAG
AGACACCTCAACAAGAAAACTTTTGTCCTTCACTTCTTGGGCCAGGTCAA
ACTTTGGGGTGTGTTATTTC CCTGAATT
12
FASTA file format
  • A sequence in FASTA format
  • begins with a single-line description
  • followed by lines of sequence data
  • description line ends with the greater-than
    ("") symbol
  • lines of text should be shorter than 80
    characters

gi532319pirTVFV2ETVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLL
LNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVT
IMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGT
NDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSV
IIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQI
ESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXX
XXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
13
Protein sequence databases
SWISS-PROT (Swiss Institute of Bioinformatics and
EMBL) PIR (George Washington University) MIPS (M
unich Information Center for Protein Sequences)
  • Explore and see what they offer

14
SWISS-PROT
15
Protein Information Resource (PIR)
16
The Transfac database transcription factors
17
The Transfac SITE file entry
AC Accession no. ID Identifier DT
Date authorTY Sequence type
DE Description (gene or gene product)
GENE accession no. RE Gene region (e. g. p
romoter, enhancer) SQ Sequence of the regu
latory element EL Denomination of the elem
ent SF First position of factor binding si
te ST Last position of factor binding site
S1 Definition of first position (if not t
ranscription start site) BF Binding factor
(FACTOR accession no. name quality biological
species) MX Deduced matrix (MATRIX accessi
on no. identifier) OS Organism species O
C Organism classification
SO Factor source (TRANSFAC CELL accession
no. name) MM Method CC Comments
DR External databases (EPD, Flybase,
TRANSCompel, TRANSPRO, PathoDB)
DR EMBL accession no. identifier
(firstlast position of the TRANSFAC se
quence element) RX MEDLINE ID RN
Reference no. RA Reference authors RT
Reference title RL Reference data //

18
The FACTOR file entries
AC Accession no. ID Identifier DT
Date author FA Factor name SY
Synonyms OS Species OC Biologi
cal classification (taxonomy) GE Encoding
gene HO Homologs (suggested) CL Cl
assification (class accession no. class
identifier decimal classification numb
er.) SZ Size (length (number of amino acid
s) calculated molecular mass in kDa
experimental molecular mass (or range)
in kDa (experimental method) Ref
SQ Sequence SC Sequence comment, i.
e. source of the protein sequence
FT Feature table (1st position last
position feature) SF Structural fe
atures CP Cell specificity (positive) CN
Cell specificity (negative)
EX Expression pattern organ,
cell name, system, developmental stage relative
level of expression (very high, high, m
edium, low, very low, detectable or none)
detection method molecule type
detected, i.e. RNA or protein reference
FF Functional features IN Interacti
ng factors (factor accession no. factor name
biological species.) MX Matrix (MATRIX acc
ession no. identifier) BS Binding SITE ac
cession no. SITE ID Quality N short
description, GENE accession no. biolog
ical species deleted //
19
Searching Transfac
TBP TBP TBP
GCN4 GAL4 GAL4
GAL4 BAF1 AP-1 DBF-A DBF-A
DBF-A
10 20 3
0 40 50 60 70
TTCTCATGTT TGACGAGCTT ATCATCGATA AGCTTTAATG
CGGTAGTTTA TCACAGTTAA ATTGCTAACG

4 11 19 27
48 11 19 27
48 19 27
48
GAL4 AP-1
AP-1 AP-1
GCN4 GCN4
GCR1 RAF BAF1 GCR1

80 90 100 110
120 130 140 CAGTCAGGCA CCGTGTAT
GA AATCTAACAA TGCGCTCATC GTCATCCTCG GCACCGTCAC
CCTGGATGCT

71 109 113
118 125 134 73
111 115 126
20
PDB structure file
HEADER HYDROLASE
05-MAY-00 1EY0
TITLE STRUCTURE OF WILD-TYPE S. NUCLEASE AT
1.6 A RESOLUTION
COMPND MOL_ID 1

COMPND 2 MOLECULE STAPHYLOCOCCAL NUCLEASE

COMPND 3 CHAIN A

COMPND 4 EC 3.1.31.1

COMPND 5 ENGINEERED YES

SOURCE MOL_ID 1

SOURCE 2 ORGANISM_SCIENTIFIC STAPHYLOCOCCUS
AUREUS
SOURCE 3 ORGANISM_COMMON BACTERIA

SOURCE 4 STRAIN FOGGI

SOURCE 5 EXPRESSION_SYSTEM ESCHERICHIA COLI

SOURCE 6 EXPRESSION_SYSTEM_COMMON BACTERIA

KEYWDS HYDROLASE

EXPDTA X-RAY DIFFRACTION

AUTHOR J.CHEN,Z.LU,J.SAKON,W.E.STITES

REVDAT 1 18-OCT-00 1EY0 0

JRNL AUTH J.CHEN,Z.LU,J.SAKON,W.E.STITES

JRNL TITL INCREASING THE THERMOSTABILITY
OF STAPHYLOCOCCAL
JRNL TITL 2 NUCLEASE IMPLICATIONS FOR THE
ORIGIN OF PROTEIN
JRNL TITL 3 THERMOSTABILITY

JRNL REF J.MOL.BIOL.
V. 303 125 2000
21
PDB structure file
HELIX 1 1 TYR A 54 ASN A 68 1
15
HELIX 2 2 VAL A 99 GLN A 106 1
8
HELIX 3 3 HIS A 121 GLU A 135 1
15
HELIX 4 4 LEU A 137 SER A 141 5
5
SHEET 1 A 7 LYS A 97 MET A 98 0

SHEET 2 A 7 GLY A 88 ALA A 94 -1 N ALA
A 94 O LYS A 97
SHEET 3 A 7 ILE A 72 PHE A 76 -1 O GLU
A 73 N TYR A 93
SHEET 4 A 7 LYS A 9 ALA A 17 -1 O GLU
A 10 N VAL A 74
SHEET 5 A 7 THR A 22 TYR A 27 -1 N LYS
A 24 O LYS A 16
SHEET 6 A 7 GLN A 30 LEU A 36 -1 O GLN
A 30 N TYR A 27
SHEET 7 A 7 GLY A 88 ALA A 94 1 O GLY
A 88 N ARG A 35
SHEET 1 B 2 VAL A 39 ASP A 40 0

SHEET 2 B 2 LYS A 110 VAL A 111 -1 O LYS
A 110 N ASP A 40
ATOM 1 N LYS A 6 63.582 22.010
-9.585 1.00 67.45 N
ATOM 2 CA LYS A 6 63.469 22.366
-8.175 1.00 98.38 C
ATOM 3 C LYS A 6 62.720 23.683
-7.978 1.00 57.15 C
ATOM 4 O LYS A 6 63.134 24.746
-8.431 1.00 61.37 O
ATOM 5 CB LYS A 6 64.851 22.447
-7.522 1.00 60.18 C
ATOM 6 CG LYS A 6 65.999 22.092
-8.457 1.00151.86 C
ATOM 7 CD LYS A 6 66.725 20.834
-8.012 1.00157.91 C
ATOM 8 CE LYS A 6 68.170 20.803
-8.483 1.00156.93 C
ATOM 9 NZ LYS A 6 68.991 19.821
-7.717 1.00 62.55 N
END

22
PDB file field codes
ATOM 1 N LYS A 6 63.582 22.010
-9.585 1.00 67.45 N
--------------------------------------------------
------------------------- Field Column
FORTRAN
No. range format Description

--------------------------------------------------
------------------------- 1. 1 - 6
A6 Record ID (eg ATOM, HETATM)
2. 7 - 11 I5 Atom serial
number
- 12 - 12 1X Blank
3. 13 - 16
A4 Atom name (eg " CA " , " ND1")
4. 17 - 17 A1 Alternative
location code (if any)
5. 18 - 20 A3 Standard
3-letter amino acid code for residue
- 21 - 21 1X Blank
6. 22 - 22
A1 Chain identifier code
7. 23 - 26 I4 Resid
ue sequence number
8. 27 - 27 A1 Insertion code
(if any)
- 28 - 30 3X Blank
9. 31 - 38
F8.3 Atom's x-coordinate
10. 39 - 46 F8.3 Atom's
y-coordinate
11. 47 - 54 F8.3 Atom's
z-coordinate
12. 55 - 60 F6.2 Occupancy value
for atom
13. 61 - 66 F6.2 B-value (thermal
factor) - 67 - 67
1X Blank
14. 68 - 70 I3 Footnot
e number
--------------------------------------------------
-------------------------
23
Bibliographic databases
24
Entrez
  • pronounced áhn-trey
  • from the french meaning enter

25
Searching with Entrez
  • Enter Surname Initials (no commas)
  • Enter Surname1 Initials1 Surname2 Initial2 (no
    commas)
  • Can search full names, e.g. Robert Simpson
  • If searching for ambiguous names, e.g. Ryan
    James, enter james, ryan
  • Click Go
  • Click Go

26
Search Results
  • Authors
  • Title
  • Journal
  • Volume
  • Page numbers
  • Link

27
Choosing what to display
28
MEDLINE database flat file format
29
Choosing file destination
  • Import into bibliographic reference database

30
National Bioinformatics Network
  • National courses
Write a Comment
User Comments (0)
About PowerShow.com