BIOINFORMATICS Databases - PowerPoint PPT Presentation

About This Presentation
Title:

BIOINFORMATICS Databases

Description:

Selection as Array Lookup. Same for a fold identifier from ... Name City Area-Code Phone-Number. Charles NY 212 345-6789. Mark SF 415 236-8982. Jane NY 212 &nbs ... – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 36
Provided by: off661
Category:

less

Transcript and Presenter's Notes

Title: BIOINFORMATICS Databases


1
BIOINFORMATICSDatabases
  • Mark Gerstein, Yale University
  • bioinfo.mbb.yale.edu/mbb452a

2
Start of class 2005,04.11 (Bioinfo-9)
3
Contents Databases
  • Structuring Information in Tables
  • Keys and Joins
  • Normalization
  • Complex RDB encoding
  • Indexes and Optimization
  • Forms and Reports

4
UnstructuredData
This type of membership analysis has been
performed previously in terms of the occurrence
of sequence motifs, families, functions, and
biochemical pathways. Starting from the most
basic units, genomes have been compared in terms
of the relative frequencies of short
oligonucleotide and oligopeptide words
(Blaisdell et al., 1996 Karlin Burge, 1995
Karlin et al., 1992 Karlin et al., 1996). The
degree of gene duplication in a number of genomes
has been ascertained (Brenner et al., 1995
Koonin et al., 1996b Riley Labedan, 1997
Wolfe Shields, 1997 Gerstein, 1997 Tamames et
al., 1997). Other analyses have looked at how
many highly conserved sequence families in one
organism are present in another (Green et al.,
1993 Koonin et al., 1995 Tatusov et al., 1997
Ouzounis et al., 1995a,b Clayton et al., 1997).
Finally, if sequences can be related to specific
functions and pathways, one can see whether
homologous sequences in two organisms truly have
the same role (ortholog vs. paralog) and whether
particular pathways are present or absent in
different organisms (Karp et al., 1996a Karp et
al., 1996b Koonin et al., 1996a Mushegian
Koonin, 1996 Tatusov et al., 1996, 1997). This
work has yielded many interesting conclusions in
terms of pathways that are modified or absent in
certain organisms. For instance, the essential
citric acid cycle is found to be highly modified
in H. influenzae (Fleischmann et al., 1995
Tatusov et al., 1996). Furthermore, identifying
pathways and proteins unique to certain microbes
may prove useful for developing drugs (e.g.
antibiotics against bacteria, Tatusov et al.,
1997). In some genome annotation systems,
attempts have been made to integrate a variety of
membership analyses and perform them on a large
scale in a highly automated fashion (Bork et al.,
1992a Bork et al., 1992b Scharf et al., 1994
Casari et al., 1995 Ouzounis et al., 1995a
Gaasterland Sensen, 1996).
5
Semi-Structured Data
  • REMARK 8 HET GROUP TRIVIAL NAME FLAVIN ADENINE
    DINUCLEOTIDE (FAD) 1FNB 79
  • REMARK 8 CAS REGISTRY NUMBER 146-14-5
    1FNB 80
  • REMARK 8 SEQUENCE NUMBER 315
    1FNB 81
  • REMARK 8 NUMBER OF ATOMS IN GROUP 53
    1FNB 82
  • REMARK 8
    1FNB 83
  • REMARK 8 HET GROUP TRIVIAL NAME PHOSPHATE
    1FNB 84
  • REMARK 8 SEQUENCE NUMBER 316
    1FNB 85
  • REMARK 8 NUMBER OF ATOMS IN GROUP 5
    1FNB 86
  • REMARK 8
    1FNB 87
  • REMARK 8 HET GROUP TRIVIAL NAME SULFATE
    1FNB 88
  • REMARK 8 SEQUENCE NUMBER 317
    1FNB 89
  • REMARK 8 NUMBER OF ATOMS IN GROUP 5
    1FNB 90
  • REMARK 8
    1FNB 91
  • REMARK 8 HET GROUP TRIVIAL NAME K2 PT(CN)4
    1FNB 92
  • REMARK 8 CHARGE 2- ( PT(CN)4 -- )
    1FNB 93
  • REMARK 8 SEQUENCE NUMBER PT1 - PT7
    1FNB 94
  • REMARK 8 NUMBER OF ATOMS IN GROUP 9
    1FNB 95
  • REMARK 8 ADDITIONAL COMMENTS BINDING SITES
    USED IN MIR PHASING 1FNB 96
  • REMARK 8
    1FNB 97

6
Structured Data
did_ fids d2rs51_ 1.002.007 d1imr__
1.010.002 d1pyib1 1.007.030 d1dxtd_
1.001.001 d181l__ 1.004.002 d1vmoa_
1.002.044 d2gsq_1 1.001.031 d1etb2_
1.002.003 d1guha1 1.001.031 d1hrc__
1.001.003 d150lc_ 1.004.002 d1dmf__
1.007.035 d1l19__ 1.004.002 d1yrnc_
1.010.002 d1apld_ 1.001.004 d1ndab2
1.003.004 d2rmai_ 1.002.036
gid_ TrgStrt TrgStop did HI0299 119
135 d193l__ HI0572 180 240 d1aba__
HI0989 56 125 d1aco_1 HI0988 106
458 d1aco_2 HI0154 2 76 d1acp__
HI1633 2 432 d1adea_ HI0349 1
183 d1aky__ HI1309 35 52 d1alo_3
HI0589 8 25 d1alo_3 HI1358 239
444 d1amg_2 HI1358 218 410 d1amy_2
HI0460 20 24 d1ans__ HI1386 139
147 d1ans__ HI0421 11 14 d1ans__
HI0361 285 295 d1ans__ HI0835 100
106 d1ans__
fid_ bestrep N_minsp N_scop
objname 1.001.001 d1flp__ 8 340
Globin-like 1.001.002 d1hdj__ 4 33
Long alpha-hairpin 1.001.003 d1ctj__ 9
78 Cytochrome c 1.001.004 d1enh__
18 76 DNA-binding 3-helical
bundle 1.001.005 d1dtr_2 1 3
Diphtheria toxin repressor (DtxR)
dimeriz 1.001.006 d1tns__ 1 2
Mu transposase, DNA-binding domain 1.001.007
d2spca_ 1 2 Spectrin repeat
unit 1.001.008 d1bdd__ 1 4
Immunoglobulin-binding protein A
modules 1.001.009 d1bal__ 1 5
Peripheral subunit-binding domain of
2-ox 1.001.010 d2erl__ 3 5
Protozoan pheromone proteins
7
Relational Databases
  • Databases make program data persistent
  • RDBs turn formless data in a number of
    structured tables
  • Ways of joining together tables to give various
    views of the data

8
SQL
Core
  • SIMPLE Language for Building and Querying Tables
  • CREATE a table
  • INSERT values into it
  • SELECT various entries from it (tuples, rows)
  • UPDATE the values
  • Example How Many Globin Foldsare there in E.
    coli versus Yeast?
  • Structures cx genomes, structures are arranged
    into folds, names of the folds (e.g. globins)

9
matches table
  • create table
  • matches(gid char255,
  • Genome_ID
  • TrgStrt int,
  • Start of
  • Match in Gene TrgStop int,
  • End of Match
  • in Gene did char255,
  • ID Matching
  • Structure score real
  • e-value
  • of Match
  • )

gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI1633 2 432 d1adea_ 0
HI0349 1 183 d1aky__ 7.6e-36
HI1309 35 52 d1alo_3 1.1
HI0589 8 25 d1alo_3 1.8
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI0460 20 24 d1ans__ 1.8
HI1386 139 147 d1ans__ 3.3
HI0421 11 14 d1ans__ 6.4
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
10
matches table 2
  • insert into
  • matches
  • (gid, TrgStrt,
  • TrgStop, did,
  • score)
  • values
  • (HI0299, 119, 135, d193l__, 3.1)

gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI1633 2 432 d1adea_ 0
HI0349 1 183 d1aky__ 7.6e-36
HI1309 35 52 d1alo_3 1.1
HI0589 8 25 d1alo_3 1.8
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI0460 20 24 d1ans__ 1.8
HI1386 139 147 d1ans__ 3.3
HI0421 11 14 d1ans__ 6.4
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
11
structures table
  • create table
  • structures(did char255,
  • ID Matching
  • Structure fid char255,
  • ID of fold that
  • structure has
  • )

did_ fid d2rs51_ 1.002.007 d1imr__
1.010.002 d1pyib1 1.007.030 d1dxtd_
1.001.001 d181l__ 1.004.002 d1vmoa_
1.002.044 d2gsq_1 1.001.031 d1etb2_
1.002.003 d1guha1 1.001.031 d1hrc__
1.001.003 d150lc_ 1.004.002 d1dmf__
1.007.035 d1l19__ 1.004.002 d1yrnc_
1.010.002 d1apld_ 1.001.004 d1ndab2
1.003.004 d2rmai_ 1.002.036
10 K domain structure IDs (did) vs. 300 fold IDs
(fid)
12
folds table
  • create table
  • folds(fid char255,
  • fold ID
  • bestrep char255,
  • N_hlx int,
  • N_beta int,
  • number of helices sheets
  • name char255
  • name of fold
  • )

fid_ bestrep N_hlx N_beta
name 1.001.001 d1flp__ 8 0
Globin-like 1.001.002 d1hdj__ 4 0
Long alpha-hairpin 1.001.003 d1ctj__ 9
0 Cytochrome c 1.001.004 d1enh__ 2
0 DNA-binding 3-helical
bundle 1.001.005 d1dtr_2 1 3
Diphtheria toxin repressor (DtxR)
dimeriz 1.001.006 d1tns__ 1 2
Mu transposase, DNA-binding domain 1.001.007
d2spca_ 0 2 Spectrin repeat
unit 1.001.008 d1bdd__ 0 4
Immunoglobulin-binding protein A
modules 1.001.009 d1bal__ 0 5
Peripheral subunit-binding domain of
2-ox 1.001.010 d2erl__ 3 5
Protozoan pheromone proteins
13
The problem databased in the example here
Matching structures to a genome
14
Table Interpretation
Match Table Ways Structures A, B, and C can
match HI Genome
Structures have a limited number of folds, which
have various characteristics
15
Structure of a Table
  • Row
  • Entity, Tuple, Instance
  • Column
  • Field
  • Attribute of an Entity
  • dimension
  • Key
  • Certain Attributes (or combination of attributes)
    can uniquely identify an object, these are keys
  • NULL
  • Variant Records

Core
16
What is a Key?
  • table matches(gid, TrgStrt, TrgStop, did, score)
  • table structures(did, fid)
  • table folds(fid, bestrep, N_hlx, N_beta, name)
  • gid -gt many matches
  • gid,TrgStrt -gt unique match (one tuple)
  • thus, primary key gid,TrgStrt
  • gid,TrgStop -gt unique match as well
  • fid -gt many dids, but did -gt one fid
  • thus, primary key did
  • one-to-one between fid and name

Core
1lt-gt11-gtmanymany-gt1
17
SQL Select on a Single Table
  • Select columns from a table where
    row-selection is true
  • projection of a selection
  • Sort result on a attribute

18
SQL Select on a Single Table, Example
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0349 1 183 d1aky__ 7.6e-36
HI1309 35 52 d1alo_3 1.1
HI0589 8 25 d1alo_3 1.8
HI1358 239 444 d1amg_2 0.002
HI0016 1 173 d1dar_2 2e-07
HI0016 179 274 d1dar_1 8.5e-06
HI0016 399 476 d1dar_4 0.00031
HI0460 20 24 d1ans__ 1.8
HI1386 139 147 d1ans__ 3.3
HI0421 11 14 d1ans__ 6.4
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
  • Select from matches where gid HI0016
  • HI0016 1 173 d1dar_2 2e-07
  • HI0016 179 274 d1dar_1 8.5e-06
  • HI0016 399 476 d1dar_4 0.00031
  • Select from matches where gid HI0016 and
    TrgStrt179
  • HI0016 179 274 d1dar_1 8.5e-06

19
SQL Select on a Single Table, Example 2
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0349 1 183 d1aky__ 7.6e-36
HI1309 35 52 d1alo_3 1.1
HI0589 8 25 d1alo_3 1.8
HI1358 239 444 d1amg_2 0.002
HI0016 1 173 d1dar_2 2e-07
HI0016 179 274 d1dar_1 8.5e-06
HI0016 399 476 d1dar_4 0.00031
HI0460 20 24 d1ans__ 1.8
HI1386 139 147 d1ans__ 3.3
HI0421 11 14 d1ans__ 6.4
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
  • Select did from matches where score lt 0.0001
  • d1aky__, d1dar_2, d1dar_1
  • HI0349 1 183 d1aky__ 7.6e-36
  • I0016 1 173 d1dar_2 2e-07
  • HI0016 179 274 d1dar_1 8.5e-06

20
Joins
Matches
Structures
did_ fid d2rs51_ 1.002.007 d1imr__
1.010.002 d1pyib1 1.007.030 d1dxtd_
1.001.001 d181l__ 1.004.002 d1vmoa_
1.002.044 d2gsq_1 1.001.031 d1etb2_
1.002.003 d1guha1 1.001.031 d1hrc__
1.001.003 d150lc_ 1.004.002 d1dmf__
1.007.035 d1l19__ 1.004.002 d1yrnc_
1.010.002 d1ans__ 1.007.008 d2rmai_ 1.002.036
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI1633 2 432 d1adea_ 0
HI0349 1 183 d1aky__ 7.6e-36
HI1309 35 52 d1alo_3 1.1
HI0589 8 25 d1alo_3 1.8
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI0460 20 24 d1ans__ 1.8
HI1386 139 147 d1ans__ 3.3
HI0421 11 14 d1ans__ 6.4
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
Core
Foreign Key
Folds
fid_ bestrep N_hlx N_beta
name 1.001.001 d1flp__ 8 0
Globin-like 1.001.002 d1hdj__ 4 0
Long alpha-hairpin 1.001.003 d1ctj__ 9
0 Cytochrome c 1.001.004 d1enh__ 2
0 DNA-binding 3-helical
bundle 1.001.005 d1dtr_2 1 3
Diphtheria toxin repressor (DtxR)
dimeriz 1.001.006 d1tns__ 1 2
Mu transposase, DNA-binding domain 1.001.007
d2spca_ 0 2 Spectrin repeat
unit 1.001.008 d1bdd__ 0 4
Immunoglobulin-binding protein A
modules 1.007.008 d1qkt__ 4 3
Neurotoxin III (ATX III) 1.001.010 d2erl__
3 5 Protozoan pheromone proteins
21
SQL Select on Multiple Tables
  • Select from matches, structures,
    foldswherematches.gid HI0361and
    matches.didstructures.didand structures.fid
    folds.fid
  • Returnsmatches structures
    folds HI0361,285,295,d1ans__ ,8.2
    d1ans__,1.007.008 1.007.008,d1qkt__,4,
    3,Neurotoxin III ...
  • Select score,name from matches, structures, folds
    where gid HI0361and matches.didstructures.dida
    nd structures.fid folds.fid 8.2, Neurotoxin
    III ...

22
Foreign Key
structures
matches
did_ fid d2rs51_ 1.002.007 d1imr__
1.010.002 d1pyib1 1.007.030 d1dxtd_
1.001.001 d181l__ 1.004.002 d1vmoa_
1.002.044 d2gsq_1 1.001.031 d1etb2_
1.002.003 d1guha1 1.001.031 d1hrc__
1.001.003 d150lc_ 1.004.002 d1dmf__
1.007.035 d1l19__ 1.004.002 d1yrnc_
1.010.002 d1ans__ 1.007.008 d2rmai_ 1.002.036
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI1633 2 432 d1adea_ 0
HI0349 1 183 d1aky__ 7.6e-36
HI1309 35 52 d1alo_3 1.1
HI0589 8 25 d1alo_3 1.8
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI0460 20 24 d1ans__ 1.8
HI1386 139 147 d1ans__ 3.3
HI0421 11 14 d1ans__ 6.4
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
matches.did is a (foreign) key in the structures
table -- i.e. looks up exactly one structure.
23
Selection as Array Lookup
  • Same for a fold identifier from a structure id
  • fidstructuredid
  • (perl pseudo-code)
  • Same for matches and folds tables, but this time
    arrays return multiple values and have multiple
    field keys
  • (bestrep, N_hlx, N_beta, name) foldsfid
  • (TrgStop,did,score)matchgid,TrgStrt
  • Joining as a double-lookup
  • did 1mbd__(bestrep, N_hlx, N_beta, name)
    folds structuresdid
  • Select bestrep,N_hlx,N_beta,name from structures,
    folds where structures.fid folds.fid and
    structures.did 1mbd__

24
Joins
Extra
did_ fid d2rs51_ 1.002.007 d1imr__
1.010.002 d1pyib1 1.007.030 d1dxtd_
1.001.001 d181l__ 1.004.002 d1vmoa_
1.002.044 d2gsq_1 1.001.031 d1etb2_
1.002.003 d1guha1 1.001.031 d1hrc__
1.001.003 d150lc_ 1.004.002 d1dmf__
1.007.035 d1l19__ 1.004.002 d1yrnc_
1.010.002 d1ans__ 1.007.008 d2rmai_ 1.002.036
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI1633 2 432 d1adea_ 0
HI0349 1 183 d1aky__ 7.6e-36
HI1309 35 52 d1alo_3 1.1
HI0589 8 25 d1alo_3 1.8
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI0460 20 24 d1ans__ 1.8
HI1386 139 147 d1ans__ 3.3
HI0421 11 14 d1ans__ 6.4
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
fid_ bestrep N_hlx N_beta
name 1.001.001 d1flp__ 8 0
Globin-like 1.001.002 d1hdj__ 4 0
Long alpha-hairpin 1.001.003 d1ctj__ 9
0 Cytochrome c 1.001.004 d1enh__ 2
0 DNA-binding 3-helical
bundle 1.001.005 d1dtr_2 1 3
Diphtheria toxin repressor (DtxR)
dimeriz 1.001.006 d1tns__ 1 2
Mu transposase, DNA-binding domain 1.001.007
d2spca_ 0 2 Spectrin repeat
unit 1.001.008 d1bdd__ 0 4
Immunoglobulin-binding protein A
modules 1.007.008 d1qkt__ 4 3
Neurotoxin III (ATX III) 1.001.010 d2erl__
3 5 Protozoan pheromone proteins
25
Join Gives Unnormalized Table
Joining Two or More Tables with a Select Query
Gives a New, Bigger Table
Core
gid_ TrgStrt TrgStop did score fid
N_hlx N_beta name HI0299 119 135 d193l__
3.1 1.010.002 0 2 Spectrin repeat
unit HI0572 180 240 d1aba__ 0.0032
1.002.045 1 2 Mu transposase, DNA-binding
domain HI0989 56 125 d1aco_1 0.0049
1.001.031 8 0 Globin-like HI0988 106
458 d1aco_2 4.4e-14 1.001.031 8 0
Globin-like HI0154 2 76 d1acp__ 1.2e-23
1.001.031 8 0 Globin-like HI1633 2
432 d1adea_ 0 1.010.002 0 2
Spectrin repeat unit HI0349 1 183
d1aky__ 7.6e-36 1.001.031 8 0
Globin-like HI1309 35 52 d1alo_3 1.1
1.007.008 4 3 Neurotoxin III (ATX
III) HI0589 8 25 d1alo_3 1.8
1.002.045 1 2 Mu transposase, DNA-binding
domain HI1358 239 444 d1amg_2 0.002
1.004.002 1 3 Diphtheria toxin repressor
(DtxR) HI1358 218 410 d1amy_2 0.00037
1.002.044 0 4 Immunoglobulin-binding
protein A HI0460 20 24 d1ans__ 1.8
1.007.008 4 3 Neurotoxin III (ATX
III) HI1386 139 147 d1ans__ 3.3
1.007.008 4 3 Neurotoxin III (ATX
III) HI0421 11 14 d1ans__ 6.4
1.007.008 4 3 Neurotoxin III (ATX
III) HI0361 285 295 d1ans__ 8.2
1.007.008 4 3 Neurotoxin III (ATX
III) HI0835 100 106 d1ans__ 9.7
1.007.008 4 3 Neurotoxin III (ATX III)
26
Normalization
  • What if Want to update Fold 1.007.008 to be
    Neurotoxin IV?
  • Many Updates
  • So Good if Previously Normalized into Separate
    Tables
  • Eliminate Redundancy
  • Allow Consistent Updating

Core
gid_ TrgStrt TrgStop did score fid
N_hlx N_beta name HI0299 119 135 d193l__
3.1 1.010.002 0 2 Spectrin repeat
unit HI0572 180 240 d1aba__ 0.0032
1.002.045 1 2 Mu transposase, DNA-binding
domain HI0989 56 125 d1aco_1 0.0049
1.001.031 8 0 Globin-like HI0988 106
458 d1aco_2 4.4e-14 1.001.031 8 0
Globin-like HI0154 2 76 d1acp__ 1.2e-23
1.001.031 8 0 Globin-like HI1633 2
432 d1adea_ 0 1.010.002 0 2
Spectrin repeat unit HI0349 1 183
d1aky__ 7.6e-36 1.001.031 8 0
Globin-like HI1309 35 52 d1alo_3 1.1
1.007.008 4 3 Neurotoxin III (ATX
III) HI0589 8 25 d1alo_3 1.8
1.002.045 1 2 Mu transposase, DNA-binding
domain HI1358 239 444 d1amg_2 0.002
1.004.002 1 3 Diphtheria toxin repressor
(DtxR) HI1358 218 410 d1amy_2 0.00037
1.002.044 0 4 Immunoglobulin-binding
protein A HI0460 20 24 d1ans__ 1.8
1.007.008 4 3 Neurotoxin III (ATX
III) HI1386 139 147 d1ans__ 3.3
1.007.008 4 3 Neurotoxin III (ATX
III) HI0421 11 14 d1ans__ 6.4
1.007.008 4 3 Neurotoxin III (ATX
III) HI0361 285 295 d1ans__ 8.2
1.007.008 4 3 Neurotoxin III (ATX
III) HI0835 100 106 d1ans__ 9.7
1.007.008 4 3 Neurotoxin III (ATX III)
27
Normalization Example
Un-normalized
Normalized
Name City Phone-Number Charles NY
345-6789 Mark SF 236-8982
Jane NY 567-2345 Jeff SF
435-3535 Jack Boston 234-9988
Name City Area-Code Phone-Number Charles
NY 212 345-6789 Mark SF
415 236-8982 Jane NY 212
567-2345 Jeff SF 415
435-3535 Jack Boston 617 234-9988

City Area-Code NY 212 SF
415 Boston 617
Core
28
Normalized Tables
did_ fid d2rs51_ 1.002.007 d1imr__
1.010.002 d1pyib1 1.007.030 d1dxtd_
1.001.001 d181l__ 1.004.002 d1vmoa_
1.002.044 d2gsq_1 1.001.031 d1etb2_
1.002.003 d1guha1 1.001.031 d1hrc__
1.001.003 d150lc_ 1.004.002 d1dmf__
1.007.035 d1l19__ 1.004.002 d1yrnc_
1.010.002 d1ans__ 1.007.008 d2rmai_ 1.002.036
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI1633 2 432 d1adea_ 0
HI0349 1 183 d1aky__ 7.6e-36
HI1309 35 52 d1alo_3 1.1
HI0589 8 25 d1alo_3 1.8
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI0460 20 24 d1ans__ 1.8
HI1386 139 147 d1ans__ 3.3
HI0421 11 14 d1ans__ 6.4
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
Theory of Normaliz-ation
fid_ bestrep N_hlx N_beta
name 1.001.001 d1flp__ 8 0
Globin-like 1.001.002 d1hdj__ 4 0
Long alpha-hairpin 1.001.003 d1ctj__ 9
0 Cytochrome c 1.001.004 d1enh__ 2
0 DNA-binding 3-helical
bundle 1.001.005 d1dtr_2 1 3
Diphtheria toxin repressor (DtxR)
dimeriz 1.001.006 d1tns__ 1 2
Mu transposase, DNA-binding domain 1.001.007
d2spca_ 0 2 Spectrin repeat
unit 1.001.008 d1bdd__ 0 4
Immunoglobulin-binding protein A
modules 1.007.008 d1qkt__ 4 3
Neurotoxin III (ATX III) 1.001.010 d2erl__
3 5 Protozoan pheromone proteins
29
Query Optimization
  • Get at the Data Quickly!!
  • Indexes
  • Hash Function Reproduce the Effect of Indexes
  • Rapidly Associate a Bucket with Each Key
  • Joining 10 tables, which to do first?
  • Joining is slow so store some tables in
    unnormalized form
  • Speed vs Memory

30
Indexes Speed Access
One Index
No Index
Double Index
31
Object Databases
Extra
C, fortran vs. C
32
Forms reports user views
  • Reports are the result of running a succession of
    selects queries on a database, joining together a
    number of tables, and then pasting the results
    together
  • Forms are the same but they are editable
  • Forms and Reports represent particular views of
    the data
  • For instance, one can be keyed on gene id listing
    all the structures matching a gene and the other
    could be keyed on structure id listing all the
    gene matching a given structure

Extra
33
Aspects of FormsTransactions and Security
  • Transactions
  • Genome Centers and United Airlines!
  • Log each entry and enable UNDO
  • Security
  • Only certain users can modify certain fields

34
Complex Data ExampleEncoding Trees in RDBs
Node Name1 Organism2 Bacteria3 Archea4 Eukarya
5 Metazoa6 Plants
Node Parent 1 02 13 14 15 46 4
1
3
2
4
5
6
35
Middle of class 2005,04.11 (Bioinfo-9)
Write a Comment
User Comments (0)
About PowerShow.com