Title: Introduction to Databases
1Introduction to Databases
2Structure
- Motivation
- Introduction to MySQL
- Example Queries
- Using SQL to query SCOP
3Motivation
- In the last term,
- we accessed most information online via the web
- we interacted directly and manually with
databases and tools - we had to manually submit queries, interpret
results. select interesting results, cutpaste
them, and submit queries again, - Pro
- Reasonably easy to get hold of information
- Con
- Not possible to ask many queries
- Queries limited by interface provided by web page
- Difficult/impossible to integrate information
from different sites - In this term, we will look at the databases
underlying the online front ends - How is the data internally stored?
- How can we - and more important computer programs
- directly interact with the underlying data, so
that we can ask more powerful queries, large
queries, and integrate different systems
4What actually happens when you retrieve data
online?
5What actually happens
- You are limited by what web server allows you to
ask - Example CATH
- PDB ID,
- CATH code, or
- General text
- But you cannot ask
- In how many different PDB structures is there a
P-loop domain? - Is there a PDB entry with a P-loop and a
DNA-binding domain - How many different superfamilies does the largest
structure in PDB have? - With direct access to the underlying database you
could answer all these questions (and many more)
6Querying over the Web
- Problem is always the same
- The web interface limits access to the underlying
database - How can we interact directly with the database
7What databases are about
- Logical organization of data
- data models, schema design, dictionaries
- Physical organization of data
- Fast retrieval, indexing, compact storage of data
- Other requirements
- Logging (important to know who did what to the
data) - Security and access control (important to know
who can do what) - Transactions and concurrency control (important
when more than one person is working on database) - Integrity (important to ensure that only valid
entries in the database) - Recovery (important as hardware and software can
sometimes fail
8Different types of databases
- Flat files
- XML
- Relational database
- (Object databases)
- (Object relational databases)
9Flat files
ID BTBPTIG standard genomic DNA MAM 3998
BP. XX AC X03365 K00966 XX SV
X03365.1 XX DT 18-NOV-1986 (Rel. 10,
Created) DT 20-MAY-1992 (Rel. 31, Last updated,
Version 3) XX DE Bovine pancreatic trypsin
inhibitor (BPTI) gene XX KW Alu-like repetitive
sequence protease inhibitor trypsin
inhibitor. XX OS Bos taurus (cow) OC
Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi Mammalia OC
Eutheria Cetartiodactyla Ruminantia Pecora
Bovoidea Bovidae Bovinae OC Bos. XX RN
1 RP 1-3998 RX MEDLINE 86158754. RX
PUBMED 2420326. RA Kingston I.B., Anderson
S. RT "Sequences encoding two trypsin
inhibitors occur in strikingly similar RT
genomic environments" RL Biochem. J.
233(2)443-450(1986). XX RN 2 RX MEDLINE
84070725. RX PUBMED 6580617. RA Anderson S.,
Kingston I.B. RT "Isolation of a genomic clone
for bovine pancreatic trypsin inhibitor by RT
using a unique-sequence synthetic DNA probe."
- We can store any data in a flat file, e.g. EMBL
- But is this a database?
- Logical data organisation None, unless we define
one (as done for EMBL) and adhere to it, which is
not enforced - Physical data organisation None, we cannot
optimise retrieval for common queries - Logging No
- Access control Implicit through Unix
- Transaction and concurrency control None
- Integrity None
- Recovery If files are backed-up they can be
recovered. However, not on the fly
10XML files
ltArticlegt ltJournalgt ltISSNgt0270-7306lt/ISSNgt ltJourn
alIssuegt ltVolumegt19lt/Volumegt ltIssuegt11lt/Issuegt
ltPubDategt ltYeargt1999lt/Yeargt ltMonthgtNovlt/M
onthgt lt/PubDategt lt/JournalIssuegt lt/Journalgt ltAr
ticleTitlegtDifferential regulation of the cell
wall integrity mitogen-activated protein kinase
pathway in budding yeast by the protein tyrosine
phosphatases Ptp2 and Ptp3. lt/ArticleTitlegt ltPagin
ationgt ltMedlinePgngt7651-60lt/MedlinePgngt lt/Paginat
iongt ltAbstractgt ltAbstractTextgtMitogen-activated
protein kinases (MAPKs) are inactivated by
dual-specificity and protein tyrosine
phosphatases (PTPs) in yeasts. In Saccharomyces
cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate
the MAPKs, Hog1 and Fus3, with different
specificities... lt/AbstractTextgt lt/Abstractgt ltAffi
liationgtDepartment of Chemistry, University of
Colorado, Boulder, Colorado 80309-0215,
USA. lt/Affiliationgt
- We can store any data in XML, the eXtentable
Mark-up Language, e.g. Medline - But is this a database?
- Logical data organisation yes, XML schema, which
is enforced - Physical data organisation None, we cannot
optimise retrieval for common queries - Logging No
- Access control Implicit through Unix
- Transaction and concurrency control None
- Integrity None
- Recovery If files are backed-up they can be
recovered. However, not on the fly
11Relational Database
- Central Idea Data as relations in a table
- E.g. SCOP, Structural Classification of Proteins
---------------------------------------------
------------------------ id type sccs
sid description
-------------------------------------------
-------------------------- 46457 cf a.1
- Globin-like
46458 sf a.1.1 -
Globin-like 46459
fa a.1.1.1 - Truncated hemoglobin
46460 dm a.1.1.1 -
Truncated hemoglobin 46461
sp a.1.1.1 - Ciliate (Paramecium
caudatum) 14982 px a.1.1.1
d1dlwa_ 1dlw A
46462 sp a.1.1.1 - Green alga
(Chlamydomonas eugametos) 14983 px
a.1.1.1 d1dlya_ 1dly A
63437 sp a.1.1.1 -
Mycobacterium tuberculosis 62301
px a.1.1.1 d1idra_ 1idr A
------------------------------
---------------------------------------
12Relational Database
- Central Idea Data as relations in a table
- E.g. Employee
------------------------------- id
name salary role --------------------
----------- 46457 pete 50.000
director 46458 jane 60.000 nurse
46459 asif 70.000 driver
-------------------------------
13Relational Database
- Central Idea Data as relations in a table
- E.g. pets
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
14Relational Database
- Central Idea Data as relations in a table
- E.g. school
---------------------- id prof
subject ---------------------- 51221
bert bio 55435 anne math
----------------------
---------------------- id name
subject ---------------------- 46458
rick bio 46459 gerd bio
46460 mary bio 46461 ella math
14982 anne math 46462 paul
math ----------------------
-------------------------- subject room
day time --------------------------
bio A mo 3pm math B
tue 1pm --------------------------
15Relational Database
- A cell in the table stores a single number or
string, but not a list - Lists, sets need to be flattened
------------------- prof subjects
------------------- bert bio,sport
anne math,arts -------------------
16Bioinformatics 10 years of resistance to
flattening!
- Why the resistance?
- Bioinformatics data is naturally nested
- Extensive Use of sets and lists
- E.g. Swissprot Features, keywords, References
-
- Such data can be flattened, but the resulting
relational schema is hard to understand hence it
is hard to formulate queries. - For example, storing the SWISSPROT entry in a
relational database would split it over 15-20
tables.
17Relational Databases
- RDB introduced in 1970 by Codd
- Took off in the 80s
- In the business world, relational databases are
the rule (Oracle, Sybase, mySQL, DB2, Microsoft
Access). - Large biomedical databases typically use a
relational technology but there are also a lot
of homegrown systems (ACeDB, SRS indexed files).
Data is almost always viewed and exported in a
variety of flat file formats (EMBL, GenBank among
others)
18The flood of biomedical data
- Since 1980, the number and size of biomedical
databases has been growing exponentially. - How can you find sources of information you are
seeking? - Nucleic Acids Research Database Issue in January
of every year (http//nar.oupjournals.org/) - Dbcat (http//www.infobiogen.fr/ ) a flat file
database of 500 biological databases.
Susan B. Davidson, Biol537/CIS636, Fall 2003
19Relational Schema
- The schema of a database is a set of relation
names, their field names and types. - Example
- Entry(ID int, Length int, Seq string, Mod
date) - Feature(ID int, Type string, From int,To
int) - Entry and Feature are relation names,
- ID, Seq, Mod, etc are attribute names, and
- int, string, date are domains
-
Susan B. Davidson, Biol537/CIS636, Fall 2003
20Relation Instance
- An instance of a relation is a set of tuples of
the type of the relation. - A tuple of Entry could be
- ( ID 82814, Length 597, Seqccagctaaccg, Mod
1-7-95) - A tuple of Feature could be
- (ID 82814, Typesource, From1,To8959)
Susan B. Davidson, Biol537/CIS636, Fall 2003
21Tabular representation
- Typically, relations are displayed as tables
- Sequence
- Feature
- ID Length Seq Mod
- 82814 597 ccagctaa...
1-07-95 - 98608 18976 accgcct...
2-14-98
?tuples
- ID Type From To
- 82814 Source 1 184
- 82814 Gene 23 65
Susan B. Davidson, Biol537/CIS636, Fall 2003
22Entities and Relationships
- There is a one-many relationship from Entry to
Feature each entry can have many features, but a
feature can be on at most one entry. - Put another way, the existence of a feature
depends on the existence of the owning entry ?
referential integrity
Susan B. Davidson, Biol537/CIS636, Fall 2003
23Integrity Constraints
- ID is the key of Entry, indicated by underlining
- No two tuples of any instance of Entry can have
the same ID. - In Feature, there is a referential integrity
constraint on ID - Every ID in Feature must appear in some tuple in
Entry. - This is specified in the data definition language
(DDL), and enforced by the system as updates are
made to the instance.
Susan B. Davidson, Biol537/CIS636, Fall 2003
24DDL for this relational schema
CREATE TABLE Feature (Id INTEGER, Type
CHAR(15), From INTEGER, To INTEGER,
PRIMARY KEY (Id, Type,
From, To) FOREIGN KEY (Id)
REFERENCES Entry ON DELETE CASCADE
ON UPDATE CASCADE)
CREATE TABLE Entry (Id INTEGER, Length
INTEGER, Sequence LONGCHAR, Mod DATE,
PRIMARY KEY (Id) )
Susan B. Davidson, Biol537/CIS636, Fall 2003
25Querying relational databases
- The language SQL has become a standard for
querying relational databases. Based on a
curious mixture of the relational algebra and
relational calculus (formal languages), it allows
new relations of information to be computed from
a set of relations. - Unlike the relational algebra, it allows other
useful stuff count, sum, min, max, etc.
Susan B. Davidson, Biol537/CIS636, Fall 2003
26Basic Query
SELECT DISTINCT target-list FROM
relation-list WHERE qualification
- relation-list A list of relation names (possibly
with a range-variable after each name). - target-list A list of attributes of relations in
relation-list. can be used to denote all atts. - qualification Comparisons (Attr op const or
Attr1 op Attr2, where op is one of lt, lt, gt, gt,
, ltgt combined
using AND, OR and NOT. - DISTINCT (optional) keyword indicates that the
answer should not contain duplicates. Default is
that duplicates are not eliminated!
Susan B. Davidson, Biol537/CIS636, Fall 2003
27Conceptual Evaluation Strategy
- Compute the product of relation-list
- Discard tuples that fail qualification
- Project over attributes in target-list
- If DISTINCT then eliminate duplicates
-
- This is probably a very bad way of executing
the query, and a good query optimizer will use
all sorts of tricks to find efficient strategies
to compute the same answer.
Susan B. Davidson, Biol537/CIS636, Fall 2003
28Sample tables
- ID Length Seq Mod
- 82814 597 ccagctaa...
1-07-95 - 98608 18976 accgcct...
2-14-98 - 16665 gtgtaa.
1-19-97 - 76582 9976 actgga
2-29-00
- ID Type From To
- 82814 Source 1 597
- 82814 Gene 23 65
- Gene 3 9999
- 13428 Gene 11000 16665
- 13428 Source 1 16665
Susan B. Davidson, Biol537/CIS636, Fall 2003
29Simple queries
Print all sequences with length less than 10000.
SELECT FROM Sequence WHERE Length lt 10000
- ID Length Seq Mod
- 82814 597 ccagctaa...
1-07-95 - 76582 9976 actgga
2-29-00
- Type
- Source
- Gene
- Gene
- Gene
- Source
SELECT Type FROM Feature
Print the type of all features.
Susan B. Davidson, Biol537/CIS636, Fall 2003
30Distinct
- Note that SQL did not eliminate duplicates. We
need to request this explicitly.
SELECT DISTINCT Type FROM Feature
Print the type of all features (no duplicates).
Susan B. Davidson, Biol537/CIS636, Fall 2003
31Pattern Matching
- Can be used in where clause. _ denotes any
character, 0 or more characters.
SELECT FROM Sequence WHERE Seq LIKE a_g'
- ID Length Seq Mod
- 98608 18976 accgcct...
2-14-98 - 76582 9976 actgga
2-29-00
Susan B. Davidson, Biol537/CIS636, Fall 2003
32Arithmetic
- as can be used to label columns in the output
arithmetic can be used to compute results
SELECT DISTINCT ID, To-From1 as Length FROM
Feature
- ID Length
- 82814 597
- 82814 43
- 13428 9997
- 13428 5666
- 13428 16665
Susan B. Davidson, Biol537/CIS636, Fall 2003
33Set operations -- union
SELECT ID FROM Sequence WHERE Lengthlt10000 UNION
SELECT ID FROM Feature WHERE TypeSource
- Duplicates do not occur in the union.
Susan B. Davidson, Biol537/CIS636, Fall 2003
34The UNION ALL operator preserves duplicates
SELECT ID FROM Sequence WHERE Lengthlt10000 UNION
ALL SELECT ID FROM Feature WHERE TypeSource
Susan B. Davidson, Biol537/CIS636, Fall 2003
35Intersection and difference
SELECT Id FROM Sequence INTERSECT SELECT Id FROM
Feature
SELECT Id FROM Sequence MINUS SELECT Id FROM
Feature
Susan B. Davidson, Biol537/CIS636, Fall 2003
36Products
SELECT FROM Sequence,Feature
- ID Length Seq
Mod ID Type From To - 82814 597 ccagctaa...
1-07-95 82814 Source 1 597 - 98608 18976 accgcct...
2-14-98 82814 Source 1 597 - 16665 gtgtaa.
1-19-97 82814 Source 1 597 - 76582 9976 actgga
2-29-00 82814 Source 1 597 - . (lots more!)
- Note that the ID column name is duplicated in the
output.
Susan B. Davidson, Biol537/CIS636, Fall 2003
37Conditional join
SELECT FROM Sequence, Feature WHERE Sequence.Id
Feature.Id
- ID Length Seq
Mod ID Type From To - 82814 597 ccagctaa...
1-07-95 82814 Source 1 597 - 82814 597 ccagctaa...
1-07-95 82814 Gene 23 65 - 13428 16665 gtgtaa.
1-19-97 13428 Gene 3 9999 - 16665 gtgtaa. 1-19-97
13428 Gene 11000 16665 - 13428 16665 gtgtaa.
1-19-97 13428 Source 1 16665
Susan B. Davidson, Biol537/CIS636, Fall 2003
38Counting
SELECT COUNT() FROM Feature
Print the number of feature entries.
Print the number of types of features.
- Surprisingly, the answer to both of these is the
following
SELECT COUNT(Type) FROM Feature
Susan B. Davidson, Biol537/CIS636, Fall 2003
39Counting, cont.
- To fix this, we use the keyword DISTINCT
- Can also use SUM, AVG, MIN and MAX.
SELECT COUNT(DISTINCT Type) FROM Feature
Susan B. Davidson, Biol537/CIS636, Fall 2003
40Group by
- So far, these aggregate operators have been
applied to all qualifying tuples. Sometimes we
want to apply them to each of several groups of
tuples. - For example Print the type and number of
features of each type.
Susan B. Davidson, Biol537/CIS636, Fall 2003
41Group by
SELECT Type, COUNT() FROM Feature GROUP BY Type
- Note that only the columns that appear in the
GROUP BY statement and aggregated columns can
appear in the output. So the following would
generate an error.
SELECT Type, From, To, COUNT() FROM
Feature GROUP BY Type
Susan B. Davidson, Biol537/CIS636, Fall 2003
42Group by having
- HAVING is to GROUP BY as WHERE is to FROM
- HAVING is used to restrict the groups that
appear in the result.
SELECT Type, COUNT() FROM Feature WHERE From-To
gt 50 GROUP BY Type HAVING AVG(From-To)gt 8500
Susan B. Davidson, Biol537/CIS636, Fall 2003
43Summary
- SQL is relationally complete allows you to
perform operators in an algebra of relations
(the relational algebra). - Additional features string comparisons, set
membership, arithmetic and grouping. - In contrast, Entrez is a much more limited
language.
Susan B. Davidson, Biol537/CIS636, Fall 2003
44A Little Exercise
45A Little Exercise
- Given the table pet below let us formulate some
queries
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
46A Little Exercise
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
47A Little Exercise
- SELECT name
- FROM pet
- Get all owners and list them only once
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
48A Little Exercise
- SELECT DISTINCT owner
- FROM pet
- Select the names of all birds
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
49A Little Exercise
- SELECT name
- FROM pet
- WHERE speciesbird
- Select the names of all female birds
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
50A Little Exercise
- SELECT name
- FROM pet
- WHERE speciesbird AND sexf
- Select names and owners
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
51A Little Exercise
- SELECT name,owner
- FROM pet
- Select owners of birds and dogs
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
52A Little Exercise
- SELECT owner
- FROM pet
- WHERE speciesbird OR speciesdog
- Select all owners starting with Dia
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
53A Little Exercise
- SELECT owner
- FROM pet
- WHERE owner LIKE Dia
- How many pets has Gwen?...
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
54A Little Exercise
- SELECT count(name)
- FROM pet
- WHERE ownerGwen
- Select owners of male pets in sorted order
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
55A Little Exercise
- SELECT owner
- FROM pet
- WHERE sexm
- ORDER BY owner
- List owners and the number of pets they have
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
56A Little Exercise
- List owners and the number of pets they have in
descending order
- SELECT owner, COUNT(name)
- FROM pet
- GROUP BY owner
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
57A Little Exercise
- List owners and the number of pets they have in
descending order, but only if they have more than
1 pet
- SELECT owner, COUNT(name) AS num
- FROM pet
- GROUP BY owner
- ORDER BY num DESC
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
58A Little Exercise
- SELECT owner, COUNT(name) AS num
- FROM pet
- GROUP BY owner HAVING num gt 1
- ORDER BY num DESC
- List all pairs of cats and dogs
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
59A Little Exercise
- SELECT p1.name, p1.species, p2.name, p2.species
- FROM pet AS p1, pet AS p2
- WHERE p1.speciesdog AND p2.speciescat
- Select all male/female pairs of the same species
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
60A Little Exercise
- SELECT p1.name, p1.species, p1.sex, p2.name,
p2.species, p2.sex - FROM pet AS p1, pet AS p2
- WHERE p1.speciesp2.species AND p1.sexm AND
p2.sexf - Can we write p1.sex ! p2.sex instead?
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
61A Little Exercise
- We would get the pair Whistler and Chirpy as well
---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
62A Little Science
- When working with SCOP through the web interface
we are limited in what we can ask - What can we get out of SCOP when it is available
as a relational table? - A reminder
- Classes all alpha, all beta, alpha/beta,
alphabeta - SCOP family gt30 sequence similarity
- SCOP superfamily good structural similiary
(possibly lt30)
63A Little Science
- At low sequence identity, good structural
alignments possible
Family
Same Superfamily, But not family
30
Picture from www.jenner.ac.uk/YBF/DanielleTalbot.p
pt
64A Little Science
- Three tables
- cla, PDB entry and reference to its class, fold,
superfamily, family, domain - des, description of each node in the SCOP
hierarchy - subchain, chain and possibly beginning and end on
chain for a domain instance - astral, sequence for a domain
65A Little Science
- mysqlgt SELECT FROM cla LIMIT 1
- --------------------------------------------
------------------------------- - sid pdb_id sccs cl cf sf
fa dm sp px - --------------------------------------------
------------------------------- - d1dlwa_ 1dlw a.1.1.1 46456 46457
46458 46459 46460 46461 14982 - --------------------------------------------
------------------------------- - mysqlgt SELECT FROM des LIMIT 1
- ---------------------------------------------
- id type sccs sid description
- ---------------------------------------------
- 46456 cl a - All alpha proteins
- ---------------------------------------------
- mysqlgt SELECT FROM astral LIMIT 1
- -----------------------------------------------
------------------------------ - sid sccs seq
- -----------------------------------------------
------------------------------ - d1dlwa_ a.1.1.1 slfeqlggqaavqavtaqfyaniqadat
vatffngidmpnqtnktaaflcaalgg...
66Entity relationship diagram for SCOP
Thanks to Boris VassilevT
67A Little Science
- How many nodes are there in the hierarchy of type
class, fold, superfamily, family?
des -----------------------------------------
---- id type sccs sid description
-------------------------------------
-------- 46456 cl a - All
alpha proteins ----------------------------
-----------------
68A Little Science
- How many nodes are there in the hierarchy of type
class, fold, superfamily, family? - Let us first find out how these types are
calledSELECT DISTINCT type - FROM des
- Now lets list them with the numbers
des -----------------------------------------
---- id type sccs sid description
-------------------------------------
-------- 46456 cl a - All
alpha proteins ----------------------------
-----------------
69A Little Science
- How many nodes are there in the hierarchy of type
class, fold, superfamily, family? - SELECT type, COUNT() AS num FROM des
GROUP BY type ORDER BY
num - There are not that many morefamilies than
superfamilies. - Which superfamily has the mostfamilies
------------- type num
------------- cl 11 cf
854 sf 1305 fa 2156 dm
4567 sp 7111 px 44327
-------------
des -----------------------------------------
---- id type sccs sid description
-------------------------------------
-------- 46456 cl a - All
alpha proteins ----------------------------
-----------------
70A Little Science
Which superfamily has the most families? SELECT
des.sccs, des.description,
COUNT(DISTINCT cla.fa) AS num FROM des,
cla WHERE des.idcla.sf GROUP BY cla.sf
ORDER BY num DESC -----------------------------
------------------------------------------
sccs description
num ------------------------
-----------------------------------------------
a.4.5 "Winged helix" DNA-binding domain
35 c.69.1
alpha/beta-Hydrolases
23 c.66.1 S-adenosyl-L-methionine-
dependent methyltransferases 20 c.52.1
Restriction endonuclease-like
19 c.37.1 P-loop containing
nucleotide triphosphate hydrolases 18
b.18.1 Galactose-binding domain-like
15 d.92.1
Metalloproteases ("zincins"), catalytic domain
14 b.29.1 Concanavalin A-like
lectins/glucanases 14
f.2.1 Membrane all-alpha
13 c.47.1
Thioredoxin-like
12 c.68.1 Nucleotide-diphospho-sug
ar transferases 12 c.2.1
NAD(P)-binding Rossmann-fold domains
11 a.4.1 Homeodomain-like
10
b.40.4 Nucleic acid-binding proteins
10 a.118.1 ARM repeat
10
71A Little Science
- Which families does the DNA binding-domain
superfamily have? - The sccs of the superfamily is a.4.5. Its
families have sccs a.4.5.1, a.4.5.2,, so how can
we list them?
mysqlgt SELECT FROM cla LIMIT 1 --------------
--------------------------------------------
----------------- sid pdb_id sccs
cl cf sf fa dm sp
px -------------------------------------
--------------------------------------
d1dlwa_ 1dlw a.1.1.1 46456 46457
46458 46459 46460 46461 14982
------------------------------------------
--------------------------------- mysqlgt
SELECT FROM des LIMIT 1 -------------------
-------------------------- id type
sccs sid description
-------------------------------------------
-- 46456 cl a - All alpha
proteins ----------------------------------
-----------
72A Little Science
- Which families does the DNA binding-domain
superfamily have? - SELECT DISTINCT sccs, description
- FROM des
- WHERE sccs LIKE a.4.5 AND typefa
- ORDER BY sccs
- sccs description
- ------------------------------------------------
---------- - a.4.5.1 Biotin repressor-like
- a.4.5.10 Replication initiation protein
- a.4.5.11 Helicase DNA-binding domain
- a.4.5.12 Restriction endonuclease FokI,
N-terminal (recognition) domain - a.4.5.13 Histone H1/H5
- a.4.5.14 Forkhead DNA-binding domain
- a.4.5.15 DNA-binding domain from rap30
- a.4.5.16 C-terminal domain of RPA32
- a.4.5.17 Cell cycle transcription factor
e2f-dp - a.4.5.18 The central core domain of TFIIE
beta - a.4.5.19 Z-DNA binding domain
73A Little Science
- Which families does the DNA binding-domain
superfamily have? - ...
- a.4.5.24 Iron-dependent represor protein
- a.4.5.25 Methionine aminopeptidase, insert
domain - a.4.5.26 mu transposase, DNA-binding domain
- a.4.5.27 TnsA endonuclease, C-terminal domain
- a.4.5.28 MarR-like transcriptional regulators
- a.4.5.29 Plant O-methyltransferase,
N-terminal domain - a.4.5.3 Arginine repressor (ArgR),
N-terminal DNA-binding domain - a.4.5.30 C-terminal domain of the rap74
subunit of TFIIF - a.4.5.31 DEP domain
- a.4.5.32 Lrp/AsnC-like transcriptional
regulator N-terminal domain - a.4.5.33 Thanscriptional regulator IclR,
N-terminal domain - a.4.5.34 SCF ubiquitin ligase complex WHB
domain - a.4.5.35 C-terminal fragment of elongation
factor SelB - a.4.5.4 CAP C-terminal domain-like
- a.4.5.5 ArsR-like transcriptional regulators
- a.4.5.6 GntR-like transcriptional regulators
74A Little Science
- Which families does the DNA binding-domain
superfamily have? - Lets find example pdbs
- mysqlgt SELECT DISTINCT pdb_id FROM cla WHERE
sccs"a.4.5.1" - --------
- pdb_id
- --------
- 1bia
- 1hxd
- 1bib
- 1j5y
- --------
- mysqlgt SELECT DISTINCT pdb_id FROM cla WHERE
sccs"a.4.5.2" - --------
- pdb_id
- --------
- 1jhf
- 1jhh
75A Little Science
1bia
1jhf
76A Little Science
1cgp a.4.5.4
77A Little Science
1smt, a.4.5.5
1b9n, a.4.5.8
1f4k, which family?
1f4k, a.4.5.7
1hw1, a.4.5.6
1bm9, a.4.5.7
78A Little Science
- How many percent of superfamilies have only 1
family, how many 2, ?
79A Little Science
- How many percent of superfamilies have only
1,2,3, families? First lets deposit the result
of the query that found the number of families
for each superfamily in a table - CREATE TABLE fa_freq AS
- SELECT des.sccs, des.description,
COUNT(DISTINCT cla.fa) AS num - FROM des, cla WHERE des.idcla.sf
- GROUP BY cla.sf
- ORDER BY num desc
- Now we count how many superfamilies have 1,2,3,
families - SELECT num AS fa_per_sf, COUNT() AS freq
- FROM fa_freq
- GROUP BY num
80A Little Science
How many percent of superfamilies have only
1,2,3, families?
----------------- fa_per_sf perc
----------------- 1 0.75
2 0.13 3 0.05 4
0.02 5 0.02 6 0.01
7 0.00 8 0.00
9 0.00 10 0.00 11
0.00 12 0.00 13 0.00
14 0.00 15 0.00
18 0.00 19 0.00 20
0.00 23 0.00 35 0.00
-----------------
- -----------------
- fa_per_sf freq
- -----------------
- 1 981
- 2 164
- 3 65
- 4 29
- 5 25
- 6 14
- 7 6
- 8 5
- 9 1
- 10 3
- 11 1
- 12 2
- 13 1
- 14 2
- 15 1
- 18 1
SELECT COUNT() FROM fa_freq ----------
count() ---------- 1305 ----------
SELECT num AS fa_per_sf, (COUNT()/1305)
AS perc FROM fa_freq GROUP BY num
81A Little Science
- How many percent of superfamilies have only
1,2,3, families? - This is interesting! For the majority of
superfamilies there is only one family! - What is the PDB structure with the largest number
of (distinct) superfamilies?
82A Little Science
- What is the PDB structure with the largest number
of (distinct) superfamilies?
- SELECT pdb_id,
- COUNT(sf) AS sf_num
- FROM cla
- GROUP BY pdb_id
- ORDER BY sf_num DESC
- LIMIT 10
- ----------------
- pdb_id sf_num
- ----------------
- 1aon 49
- 1hto 48
- 1ir2 48
- 1htq 48
- 1der 42
- 1f49 40
- 1jyy 40
- 1gho 40
- 1jyz 40
- 1jz1 40
SELECT pdb_id, COUNT(DISTINCT sf)
AS distinct_sf_num FROM cla GROUP BY pdb_id
ORDER BY distinct_sf_num DESC LIMIT
10 ------------------------- pdb_id
distinct_sf_num -------------------------
1m1k 23 1k9m
23 1kd1 23 1kqs
23 1jj2 23 1k8a
23 1ffk 22
1i96 21 1hnz
20 1hr0 20
-------------------------
83A structure with 23 different superfamilies
- 1k9m Co Crystal Structure Of Tylosin Bound To The
50S Ribosomal Subunit Of Haloarcula
MarismortuiRibosome
84A Little Science
- Now lets plot how many PDBs have 1, 2, 3,
distinct superfamilies - First of all let us put the result of the
previous slide in a table (note if the table
already exists we have to erase it first DROP
TABLE pdb_sf_num. But be careful using DROP - CREATE TABLE pdb_sf_num AS
- SELECT pdb_id,
- COUNT(DISTINCT sf) AS distinct_sf_num
- FROM cla
- GROUP BY pdb_id
- ORDER BY distinct_sf_num DESC
- Now let us count how many PDBs with 1,2,3,
distinct superfamilies
85A Little Science
- How many PDBs have 1, 2, 3, distinct
superfamilies - SELECT distinct_sf_num, COUNT(pdb_id) AS num
- FROM pdb_sf_num
- GROUP BY distinct_sf_num
- ORDER BY distinct_sf_num
- ------------------------
- distinct_sf_num num
- ------------------------
- 1 13960
- 2 2721
- 3 495
- 4 178
- 5 33
- 6 25
- 7 1
- 9 4
- 20 9
- 21 1
- 22 1
86A Little Science
- Lets do the same in percent
- SELECT COUNT(DISTINCT pdb_id)
- FROM cla
- ----------
- count()
- ----------
- 17434
- ----------
- There are 17434 PDB IDs
87A Little Science
- How many PDBs have 1, 2, 3, distinct
superfamilies - SELECT distinct_sf_num, COUNT(pdb_id)/17434 AS
perc - FROM pdb_sf_num
- GROUP BY distinct_sf_num
- ORDER BY distinct_sf_num
- -----------------------
- distinct_sf_num perc
- -----------------------
- 1 0.80
- 2 0.16
- 3 0.03
- 4 0.01
- 5 0.00
- 6 0.00
- 7 0.00
- 9 0.00
- 20 0.00
- 21 0.00
- 22 0.00
88A Little Science
- What are the most popular superfamilies?
- I.e. for which are there the most PDB entries
89A Little Science
- What are the most popular superfamilies?
- SELECT des.sccs, des.description,
- COUNT(DISTINCT cla.pdb_id) AS
num_of_pdb_ids - FROM cla,des
- WHERE des.idcla.sf
- GROUP BY cla.sf
- ORDER BY num_of_pdb_ids DESC
- LIMIT 10
- ------------------------------------------------
------------------------------ - sccs description
num_of_pdb_ids - ------------------------------------------------
------------------------------ - b.1.1 Immunoglobulin
823 - d.2.1 Lysozyme-like
777 - b.47.1 Trypsin-like serine proteases
649 - c.37.1 P-loop containing nucleotide
triphosphate hydrolases 521 - c.2.1 NAD(P)-binding Rossmann-fold domains
384 - a.1.1 Globin-like
384 - c.1.8 (Trans)glycosidases
332 - b.50.1 Acid proteases
288
90A Little Science
- Are all superfamilies equally likely to co-occur?
- Let us generate a co-occurrence map as an answer
- Which superfamilies co-occur most frequently
- Which superfamilies have the most co-occurrence
partners - The co-occurrence map should consist of two
tables - A table with PDB ID, superfamily 1, superfamily 2
(to avoid repetition we will require that sf1 is
alphabetically before sf2) - A table with superfamily 1 and 2 and the number
of PDBs containing this co-occurrence
91A Little Science
- Co-occurrence map
- SELECT DISTINCT c1.pdb_id, c1.sf, c2.sf
- FROM cla AS c1, cla AS c2
- WHERE c1.pdb_idc2.pdb_id AND c1.sfltc2.sf
- LIMIT 10
- ----------------------
- pdb_id sf sf
- ----------------------
- 1cqx 46458 63380
- 1cqx 46458 52343
- 1gvh 46458 63380
- 1gvh 46458 52343
- 1b33 46458 54580
- 1qgw 46458 56568
- 1kf6 46548 46977
- 1kf6 46548 51905
- 1kf6 46548 54292
- 1kf6 46548 56425
- ----------------------
92A Little Science
- Co-occurrence Map
- CREATE TABLE cooc AS
- SELECT DISTINCT c1.pdb_id,
- c1.sf AS sf1, d1.description AS sf1name,
- c2.sf AS sf2, d2.description AS sf2name
- FROM cla AS c1, cla AS c2, des AS d1, des AS d2
- WHERE c1.pdb_idc2.pdb_id AND c1.sfltc2.sf AND
- c1.sfd1.id AND c2.sfd2.id
- ----------------------------------------------
-------------------------------------------------
--------------- - pdb_id sf1 sf1name sf2
sf2name
- ----------------------------------------------
-------------------------------------------------
--------------- - 1cqx 46458 Globin-like
63380 Riboflavin synthase domain-like
- 1cqx 46458 Globin-like
52343 Ferredoxin reductase-like, C-terminal
NADP-linked domain - 1gvh 46458 Globin-like
63380 Riboflavin synthase domain-like
- 1gvh 46458 Globin-like
52343 Ferredoxin reductase-like, C-terminal
NADP-linked domain - 1b33 46458 Globin-like
54580 Allophycocyanin linker chain (domain)
- 1qgw 46458 Globin-like
56568 Non-globular alphabeta subunits of
globular proteins - 1kf6 46548 alpha-helical ferredoxin
46977 Succinate dehydrogenase/fumarate
reductase C-terminal domain
93A Little Science
- Number of instances in co-occurrence map
- SELECT COUNT(DISTINCT pdb_id) AS num,
- sf1, sf1name, sf2, sf2name
- FROM cooc
- GROUP BY sf1,sf2
- ORDER BY num DESC
- LIMIT 10
- -----------------------------------------------
------------------------------------------------
------------------------------------- - num sf1 sf1name
sf2 sf2name
- -----------------------------------------------
------------------------------------------------
------------------------------------- - 137 48726 Immunoglobulin
54452 MHC antigen-recognition
domain - 125 51011 alpha-Amylases, C-terminal
beta-sheet domain 51445 (Trans)glycosidases
- 117 47616 Glutathione S-transferases,
C-terminal domain 52833 Thioredoxin-like
- 99 47802 DNA polymerase beta, N-terminal
domain-like 56699 Nucleotidyltransferases
- 97 53098 Ribonuclease H-like
56672 DNA/RNApolymerases
- 74 51735 NAD(P)-binding Rossmann-fold
domains 55347 Gly