Introduction to Databases - PowerPoint PPT Presentation

1 / 130
About This Presentation
Title:

Introduction to Databases

Description:

we interacted directly and manually with databases and tools ... PubDate Year 1999 /Year Month Nov /Month /PubDate /JournalIssue /Journal ... – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 131
Provided by: biotecTu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Databases


1
Introduction to Databases
2
Structure
  • Motivation
  • Introduction to MySQL
  • Example Queries
  • Using SQL to query SCOP

3
Motivation
  • In the last term,
  • we accessed most information online via the web
  • we interacted directly and manually with
    databases and tools
  • we had to manually submit queries, interpret
    results. select interesting results, cutpaste
    them, and submit queries again,
  • Pro
  • Reasonably easy to get hold of information
  • Con
  • Not possible to ask many queries
  • Queries limited by interface provided by web page
  • Difficult/impossible to integrate information
    from different sites
  • In this term, we will look at the databases
    underlying the online front ends
  • How is the data internally stored?
  • How can we - and more important computer programs
    - directly interact with the underlying data, so
    that we can ask more powerful queries, large
    queries, and integrate different systems

4
What actually happens when you retrieve data
online?
5
What actually happens
  • You are limited by what web server allows you to
    ask
  • Example CATH
  • PDB ID,
  • CATH code, or
  • General text
  • But you cannot ask
  • In how many different PDB structures is there a
    P-loop domain?
  • Is there a PDB entry with a P-loop and a
    DNA-binding domain
  • How many different superfamilies does the largest
    structure in PDB have?
  • With direct access to the underlying database you
    could answer all these questions (and many more)

6
Querying over the Web
  • Problem is always the same
  • The web interface limits access to the underlying
    database
  • How can we interact directly with the database

7
What databases are about
  • Logical organization of data
  • data models, schema design, dictionaries
  • Physical organization of data
  • Fast retrieval, indexing, compact storage of data
  • Other requirements
  • Logging (important to know who did what to the
    data)
  • Security and access control (important to know
    who can do what)
  • Transactions and concurrency control (important
    when more than one person is working on database)
  • Integrity (important to ensure that only valid
    entries in the database)
  • Recovery (important as hardware and software can
    sometimes fail

8
Different types of databases
  • Flat files
  • XML
  • Relational database
  • (Object databases)
  • (Object relational databases)

9
Flat files
ID BTBPTIG standard genomic DNA MAM 3998
BP. XX AC X03365 K00966 XX SV
X03365.1 XX DT 18-NOV-1986 (Rel. 10,
Created) DT 20-MAY-1992 (Rel. 31, Last updated,
Version 3) XX DE Bovine pancreatic trypsin
inhibitor (BPTI) gene XX KW Alu-like repetitive
sequence protease inhibitor trypsin
inhibitor. XX OS Bos taurus (cow) OC
Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi Mammalia OC
Eutheria Cetartiodactyla Ruminantia Pecora
Bovoidea Bovidae Bovinae OC Bos. XX RN
1 RP 1-3998 RX MEDLINE 86158754. RX
PUBMED 2420326. RA Kingston I.B., Anderson
S. RT "Sequences encoding two trypsin
inhibitors occur in strikingly similar RT
genomic environments" RL Biochem. J.
233(2)443-450(1986). XX RN 2 RX MEDLINE
84070725. RX PUBMED 6580617. RA Anderson S.,
Kingston I.B. RT "Isolation of a genomic clone
for bovine pancreatic trypsin inhibitor by RT
using a unique-sequence synthetic DNA probe."
  • We can store any data in a flat file, e.g. EMBL
  • But is this a database?
  • Logical data organisation None, unless we define
    one (as done for EMBL) and adhere to it, which is
    not enforced
  • Physical data organisation None, we cannot
    optimise retrieval for common queries
  • Logging No
  • Access control Implicit through Unix
  • Transaction and concurrency control None
  • Integrity None
  • Recovery If files are backed-up they can be
    recovered. However, not on the fly

10
XML files
ltArticlegt ltJournalgt ltISSNgt0270-7306lt/ISSNgt ltJourn
alIssuegt ltVolumegt19lt/Volumegt ltIssuegt11lt/Issuegt
ltPubDategt ltYeargt1999lt/Yeargt ltMonthgtNovlt/M
onthgt lt/PubDategt lt/JournalIssuegt lt/Journalgt ltAr
ticleTitlegtDifferential regulation of the cell
wall integrity mitogen-activated protein kinase
pathway in budding yeast by the protein tyrosine
phosphatases Ptp2 and Ptp3. lt/ArticleTitlegt ltPagin
ationgt ltMedlinePgngt7651-60lt/MedlinePgngt lt/Paginat
iongt ltAbstractgt ltAbstractTextgtMitogen-activated
protein kinases (MAPKs) are inactivated by
dual-specificity and protein tyrosine
phosphatases (PTPs) in yeasts. In Saccharomyces
cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate
the MAPKs, Hog1 and Fus3, with different
specificities... lt/AbstractTextgt lt/Abstractgt ltAffi
liationgtDepartment of Chemistry, University of
Colorado, Boulder, Colorado 80309-0215,
USA. lt/Affiliationgt
  • We can store any data in XML, the eXtentable
    Mark-up Language, e.g. Medline
  • But is this a database?
  • Logical data organisation yes, XML schema, which
    is enforced
  • Physical data organisation None, we cannot
    optimise retrieval for common queries
  • Logging No
  • Access control Implicit through Unix
  • Transaction and concurrency control None
  • Integrity None
  • Recovery If files are backed-up they can be
    recovered. However, not on the fly

11
Relational Database
  • Central Idea Data as relations in a table
  • E.g. SCOP, Structural Classification of Proteins

---------------------------------------------
------------------------ id type sccs
sid description
-------------------------------------------
-------------------------- 46457 cf a.1
- Globin-like
46458 sf a.1.1 -
Globin-like 46459
fa a.1.1.1 - Truncated hemoglobin
46460 dm a.1.1.1 -
Truncated hemoglobin 46461
sp a.1.1.1 - Ciliate (Paramecium
caudatum) 14982 px a.1.1.1
d1dlwa_ 1dlw A
46462 sp a.1.1.1 - Green alga
(Chlamydomonas eugametos) 14983 px
a.1.1.1 d1dlya_ 1dly A
63437 sp a.1.1.1 -
Mycobacterium tuberculosis 62301
px a.1.1.1 d1idra_ 1idr A
------------------------------
---------------------------------------
12
Relational Database
  • Central Idea Data as relations in a table
  • E.g. Employee

------------------------------- id
name salary role --------------------
----------- 46457 pete 50.000
director 46458 jane 60.000 nurse
46459 asif 70.000 driver
-------------------------------
13
Relational Database
  • Central Idea Data as relations in a table
  • E.g. pets

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
14
Relational Database
  • Central Idea Data as relations in a table
  • E.g. school

---------------------- id prof
subject ---------------------- 51221
bert bio 55435 anne math
----------------------
---------------------- id name
subject ---------------------- 46458
rick bio 46459 gerd bio
46460 mary bio 46461 ella math
14982 anne math 46462 paul
math ----------------------
-------------------------- subject room
day time --------------------------
bio A mo 3pm math B
tue 1pm --------------------------
15
Relational Database
  • A cell in the table stores a single number or
    string, but not a list
  • Lists, sets need to be flattened

------------------- prof subjects
------------------- bert bio,sport
anne math,arts -------------------
16
Bioinformatics 10 years of resistance to
flattening!
  • Why the resistance?
  • Bioinformatics data is naturally nested
  • Extensive Use of sets and lists
  • E.g. Swissprot Features, keywords, References
  • Such data can be flattened, but the resulting
    relational schema is hard to understand hence it
    is hard to formulate queries.
  • For example, storing the SWISSPROT entry in a
    relational database would split it over 15-20
    tables.

17
Relational Databases
  • RDB introduced in 1970 by Codd
  • Took off in the 80s
  • In the business world, relational databases are
    the rule (Oracle, Sybase, mySQL, DB2, Microsoft
    Access).
  • Large biomedical databases typically use a
    relational technology but there are also a lot
    of homegrown systems (ACeDB, SRS indexed files).
    Data is almost always viewed and exported in a
    variety of flat file formats (EMBL, GenBank among
    others)

18
The flood of biomedical data
  • Since 1980, the number and size of biomedical
    databases has been growing exponentially.
  • How can you find sources of information you are
    seeking?
  • Nucleic Acids Research Database Issue in January
    of every year (http//nar.oupjournals.org/)
  • Dbcat (http//www.infobiogen.fr/ ) a flat file
    database of 500 biological databases.

Susan B. Davidson, Biol537/CIS636, Fall 2003
19
Relational Schema
  • The schema of a database is a set of relation
    names, their field names and types.
  • Example
  • Entry(ID int, Length int, Seq string, Mod
    date)
  • Feature(ID int, Type string, From int,To
    int)
  • Entry and Feature are relation names,
  • ID, Seq, Mod, etc are attribute names, and
  • int, string, date are domains

Susan B. Davidson, Biol537/CIS636, Fall 2003
20
Relation Instance
  • An instance of a relation is a set of tuples of
    the type of the relation.
  • A tuple of Entry could be
  • ( ID 82814, Length 597, Seqccagctaaccg, Mod
    1-7-95)
  • A tuple of Feature could be
  • (ID 82814, Typesource, From1,To8959)

Susan B. Davidson, Biol537/CIS636, Fall 2003
21
Tabular representation
  • Typically, relations are displayed as tables
  • Sequence
  • Feature
  • ID Length Seq Mod
  • 82814 597 ccagctaa...
    1-07-95
  • 98608 18976 accgcct...
    2-14-98

?tuples
  • ID Type From To
  • 82814 Source 1 184
  • 82814 Gene 23 65

Susan B. Davidson, Biol537/CIS636, Fall 2003
22
Entities and Relationships
  • There is a one-many relationship from Entry to
    Feature each entry can have many features, but a
    feature can be on at most one entry.
  • Put another way, the existence of a feature
    depends on the existence of the owning entry ?
    referential integrity

Susan B. Davidson, Biol537/CIS636, Fall 2003
23
Integrity Constraints
  • ID is the key of Entry, indicated by underlining
  • No two tuples of any instance of Entry can have
    the same ID.
  • In Feature, there is a referential integrity
    constraint on ID
  • Every ID in Feature must appear in some tuple in
    Entry.
  • This is specified in the data definition language
    (DDL), and enforced by the system as updates are
    made to the instance.

Susan B. Davidson, Biol537/CIS636, Fall 2003
24
DDL for this relational schema
CREATE TABLE Feature (Id INTEGER, Type
CHAR(15), From INTEGER, To INTEGER,
PRIMARY KEY (Id, Type,
From, To) FOREIGN KEY (Id)
REFERENCES Entry ON DELETE CASCADE
ON UPDATE CASCADE)
CREATE TABLE Entry (Id INTEGER, Length
INTEGER, Sequence LONGCHAR, Mod DATE,
PRIMARY KEY (Id) )
Susan B. Davidson, Biol537/CIS636, Fall 2003
25
Querying relational databases
  • The language SQL has become a standard for
    querying relational databases. Based on a
    curious mixture of the relational algebra and
    relational calculus (formal languages), it allows
    new relations of information to be computed from
    a set of relations.
  • Unlike the relational algebra, it allows other
    useful stuff count, sum, min, max, etc.

Susan B. Davidson, Biol537/CIS636, Fall 2003
26
Basic Query
SELECT DISTINCT target-list FROM
relation-list WHERE qualification
  • relation-list A list of relation names (possibly
    with a range-variable after each name).
  • target-list A list of attributes of relations in
    relation-list. can be used to denote all atts.
  • qualification Comparisons (Attr op const or
    Attr1 op Attr2, where op is one of lt, lt, gt, gt,
    , ltgt combined
    using AND, OR and NOT.
  • DISTINCT (optional) keyword indicates that the
    answer should not contain duplicates. Default is
    that duplicates are not eliminated!

Susan B. Davidson, Biol537/CIS636, Fall 2003
27
Conceptual Evaluation Strategy
  • Compute the product of relation-list
  • Discard tuples that fail qualification
  • Project over attributes in target-list
  • If DISTINCT then eliminate duplicates
  • This is probably a very bad way of executing
    the query, and a good query optimizer will use
    all sorts of tricks to find efficient strategies
    to compute the same answer.

Susan B. Davidson, Biol537/CIS636, Fall 2003
28
Sample tables
  • Sequence
  • Feature
  • ID Length Seq Mod
  • 82814 597 ccagctaa...
    1-07-95
  • 98608 18976 accgcct...
    2-14-98
  • 16665 gtgtaa.
    1-19-97
  • 76582 9976 actgga
    2-29-00
  • ID Type From To
  • 82814 Source 1 597
  • 82814 Gene 23 65
  • Gene 3 9999
  • 13428 Gene 11000 16665
  • 13428 Source 1 16665

Susan B. Davidson, Biol537/CIS636, Fall 2003
29
Simple queries
Print all sequences with length less than 10000.
SELECT FROM Sequence WHERE Length lt 10000
  • ID Length Seq Mod
  • 82814 597 ccagctaa...
    1-07-95
  • 76582 9976 actgga
    2-29-00
  • Type
  • Source
  • Gene
  • Gene
  • Gene
  • Source

SELECT Type FROM Feature
Print the type of all features.
Susan B. Davidson, Biol537/CIS636, Fall 2003
30
Distinct
  • Note that SQL did not eliminate duplicates. We
    need to request this explicitly.

SELECT DISTINCT Type FROM Feature
Print the type of all features (no duplicates).
  • Type
  • Source
  • Gene

Susan B. Davidson, Biol537/CIS636, Fall 2003
31
Pattern Matching
  • Can be used in where clause. _ denotes any
    character, 0 or more characters.

SELECT FROM Sequence WHERE Seq LIKE a_g'
  • ID Length Seq Mod
  • 98608 18976 accgcct...
    2-14-98
  • 76582 9976 actgga
    2-29-00

Susan B. Davidson, Biol537/CIS636, Fall 2003
32
Arithmetic
  • as can be used to label columns in the output
    arithmetic can be used to compute results

SELECT DISTINCT ID, To-From1 as Length FROM
Feature
  • ID Length
  • 82814 597
  • 82814 43
  • 13428 9997
  • 13428 5666
  • 13428 16665

Susan B. Davidson, Biol537/CIS636, Fall 2003
33
Set operations -- union
SELECT ID FROM Sequence WHERE Lengthlt10000 UNION
SELECT ID FROM Feature WHERE TypeSource
  • ID
  • 76582
  • 82814
  • 13428
  • Duplicates do not occur in the union.

Susan B. Davidson, Biol537/CIS636, Fall 2003
34
The UNION ALL operator preserves duplicates
SELECT ID FROM Sequence WHERE Lengthlt10000 UNION
ALL SELECT ID FROM Feature WHERE TypeSource
Susan B. Davidson, Biol537/CIS636, Fall 2003
35
Intersection and difference
SELECT Id FROM Sequence INTERSECT SELECT Id FROM
Feature
SELECT Id FROM Sequence MINUS SELECT Id FROM
Feature
Susan B. Davidson, Biol537/CIS636, Fall 2003
36
Products
SELECT FROM Sequence,Feature
  • ID Length Seq
    Mod ID Type From To
  • 82814 597 ccagctaa...
    1-07-95 82814 Source 1 597
  • 98608 18976 accgcct...
    2-14-98 82814 Source 1 597
  • 16665 gtgtaa.
    1-19-97 82814 Source 1 597
  • 76582 9976 actgga
    2-29-00 82814 Source 1 597
  • . (lots more!)
  • Note that the ID column name is duplicated in the
    output.

Susan B. Davidson, Biol537/CIS636, Fall 2003
37
Conditional join
SELECT FROM Sequence, Feature WHERE Sequence.Id
Feature.Id
  • ID Length Seq
    Mod ID Type From To
  • 82814 597 ccagctaa...
    1-07-95 82814 Source 1 597
  • 82814 597 ccagctaa...
    1-07-95 82814 Gene 23 65
  • 13428 16665 gtgtaa.
    1-19-97 13428 Gene 3 9999
  • 16665 gtgtaa. 1-19-97
    13428 Gene 11000 16665
  • 13428 16665 gtgtaa.
    1-19-97 13428 Source 1 16665

Susan B. Davidson, Biol537/CIS636, Fall 2003
38
Counting
SELECT COUNT() FROM Feature
Print the number of feature entries.
Print the number of types of features.
  • Surprisingly, the answer to both of these is the
    following

SELECT COUNT(Type) FROM Feature
Susan B. Davidson, Biol537/CIS636, Fall 2003
39
Counting, cont.
  • To fix this, we use the keyword DISTINCT
  • Can also use SUM, AVG, MIN and MAX.

SELECT COUNT(DISTINCT Type) FROM Feature
Susan B. Davidson, Biol537/CIS636, Fall 2003
40
Group by
  • So far, these aggregate operators have been
    applied to all qualifying tuples. Sometimes we
    want to apply them to each of several groups of
    tuples.
  • For example Print the type and number of
    features of each type.

Susan B. Davidson, Biol537/CIS636, Fall 2003
41
Group by
SELECT Type, COUNT() FROM Feature GROUP BY Type
  • Note that only the columns that appear in the
    GROUP BY statement and aggregated columns can
    appear in the output. So the following would
    generate an error.

SELECT Type, From, To, COUNT() FROM
Feature GROUP BY Type
Susan B. Davidson, Biol537/CIS636, Fall 2003
42
Group by having
  • HAVING is to GROUP BY as WHERE is to FROM
  • HAVING is used to restrict the groups that
    appear in the result.

SELECT Type, COUNT() FROM Feature WHERE From-To
gt 50 GROUP BY Type HAVING AVG(From-To)gt 8500
Susan B. Davidson, Biol537/CIS636, Fall 2003
43
Summary
  • SQL is relationally complete allows you to
    perform operators in an algebra of relations
    (the relational algebra).
  • Additional features string comparisons, set
    membership, arithmetic and grouping.
  • In contrast, Entrez is a much more limited
    language.

Susan B. Davidson, Biol537/CIS636, Fall 2003
44
A Little Exercise
45
A Little Exercise
  • Given the table pet below let us formulate some
    queries

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
46
A Little Exercise
  • Get all pet names

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
47
A Little Exercise
  • SELECT name
  • FROM pet
  • Get all owners and list them only once

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
48
A Little Exercise
  • SELECT DISTINCT owner
  • FROM pet
  • Select the names of all birds

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
49
A Little Exercise
  • SELECT name
  • FROM pet
  • WHERE speciesbird
  • Select the names of all female birds

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
50
A Little Exercise
  • SELECT name
  • FROM pet
  • WHERE speciesbird AND sexf
  • Select names and owners

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
51
A Little Exercise
  • SELECT name,owner
  • FROM pet
  • Select owners of birds and dogs

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
52
A Little Exercise
  • SELECT owner
  • FROM pet
  • WHERE speciesbird OR speciesdog
  • Select all owners starting with Dia

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
53
A Little Exercise
  • SELECT owner
  • FROM pet
  • WHERE owner LIKE Dia
  • How many pets has Gwen?...

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
54
A Little Exercise
  • SELECT count(name)
  • FROM pet
  • WHERE ownerGwen
  • Select owners of male pets in sorted order

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
55
A Little Exercise
  • SELECT owner
  • FROM pet
  • WHERE sexm
  • ORDER BY owner
  • List owners and the number of pets they have

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
56
A Little Exercise
  • List owners and the number of pets they have in
    descending order
  • SELECT owner, COUNT(name)
  • FROM pet
  • GROUP BY owner

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
57
A Little Exercise
  • List owners and the number of pets they have in
    descending order, but only if they have more than
    1 pet
  • SELECT owner, COUNT(name) AS num
  • FROM pet
  • GROUP BY owner
  • ORDER BY num DESC

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
58
A Little Exercise
  • SELECT owner, COUNT(name) AS num
  • FROM pet
  • GROUP BY owner HAVING num gt 1
  • ORDER BY num DESC
  • List all pairs of cats and dogs

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
59
A Little Exercise
  • SELECT p1.name, p1.species, p2.name, p2.species
  • FROM pet AS p1, pet AS p2
  • WHERE p1.speciesdog AND p2.speciescat
  • Select all male/female pairs of the same species

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
60
A Little Exercise
  • SELECT p1.name, p1.species, p1.sex, p2.name,
    p2.species, p2.sex
  • FROM pet AS p1, pet AS p2
  • WHERE p1.speciesp2.species AND p1.sexm AND
    p2.sexf
  • Can we write p1.sex ! p2.sex instead?

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
61
A Little Exercise
  • We would get the pair Whistler and Chirpy as well

---------------------------------------------
------------ name owner species
sex birth death ---------------
------------------------------------------
Whistler Gwen bird 0000-00-00
NULL Chirpy Gwen bird f
1998-09-11 0000-00-00 Bowser Diane
dog m 1979-08-31 1995-07-29 Fang
Benny dog m 1990-08-27
0000-00-00 Buffy Harold dog f
1989-05-13 0000-00-00 Claws Gwen
cat m 1994-03-17 0000-00-00
Fluffy Harold cat f 1993-02-04
0000-00-00 Slim Benny snake m
1996-04-29 0000-00-00 -------------------
--------------------------------------
62
A Little Science
  • When working with SCOP through the web interface
    we are limited in what we can ask
  • What can we get out of SCOP when it is available
    as a relational table?
  • A reminder
  • Classes all alpha, all beta, alpha/beta,
    alphabeta
  • SCOP family gt30 sequence similarity
  • SCOP superfamily good structural similiary
    (possibly lt30)

63
A Little Science
  • At low sequence identity, good structural
    alignments possible

Family
Same Superfamily, But not family
30
Picture from www.jenner.ac.uk/YBF/DanielleTalbot.p
pt
64
A Little Science
  • Three tables
  • cla, PDB entry and reference to its class, fold,
    superfamily, family, domain
  • des, description of each node in the SCOP
    hierarchy
  • subchain, chain and possibly beginning and end on
    chain for a domain instance
  • astral, sequence for a domain

65
A Little Science
  • mysqlgt SELECT FROM cla LIMIT 1
  • --------------------------------------------
    -------------------------------
  • sid pdb_id sccs cl cf sf
    fa dm sp px
  • --------------------------------------------
    -------------------------------
  • d1dlwa_ 1dlw a.1.1.1 46456 46457
    46458 46459 46460 46461 14982
  • --------------------------------------------
    -------------------------------
  • mysqlgt SELECT FROM des LIMIT 1
  • ---------------------------------------------
  • id type sccs sid description
  • ---------------------------------------------
  • 46456 cl a - All alpha proteins
  • ---------------------------------------------
  • mysqlgt SELECT FROM astral LIMIT 1
  • -----------------------------------------------
    ------------------------------
  • sid sccs seq
  • -----------------------------------------------
    ------------------------------
  • d1dlwa_ a.1.1.1 slfeqlggqaavqavtaqfyaniqadat
    vatffngidmpnqtnktaaflcaalgg...

66
Entity relationship diagram for SCOP
Thanks to Boris VassilevT
67
A Little Science
  • How many nodes are there in the hierarchy of type
    class, fold, superfamily, family?

des -----------------------------------------
---- id type sccs sid description
-------------------------------------
-------- 46456 cl a - All
alpha proteins ----------------------------
-----------------
68
A Little Science
  • How many nodes are there in the hierarchy of type
    class, fold, superfamily, family?
  • Let us first find out how these types are
    calledSELECT DISTINCT type
  • FROM des
  • Now lets list them with the numbers

des -----------------------------------------
---- id type sccs sid description
-------------------------------------
-------- 46456 cl a - All
alpha proteins ----------------------------
-----------------
69
A Little Science
  • How many nodes are there in the hierarchy of type
    class, fold, superfamily, family?
  • SELECT type, COUNT() AS num FROM des
    GROUP BY type ORDER BY
    num
  • There are not that many morefamilies than
    superfamilies.
  • Which superfamily has the mostfamilies

------------- type num
------------- cl 11 cf
854 sf 1305 fa 2156 dm
4567 sp 7111 px 44327
-------------
des -----------------------------------------
---- id type sccs sid description
-------------------------------------
-------- 46456 cl a - All
alpha proteins ----------------------------
-----------------
70
A Little Science
Which superfamily has the most families? SELECT
des.sccs, des.description,
COUNT(DISTINCT cla.fa) AS num FROM des,
cla WHERE des.idcla.sf GROUP BY cla.sf
ORDER BY num DESC -----------------------------
------------------------------------------
sccs description
num ------------------------
-----------------------------------------------
a.4.5 "Winged helix" DNA-binding domain
35 c.69.1
alpha/beta-Hydrolases
23 c.66.1 S-adenosyl-L-methionine-
dependent methyltransferases 20 c.52.1
Restriction endonuclease-like
19 c.37.1 P-loop containing
nucleotide triphosphate hydrolases 18
b.18.1 Galactose-binding domain-like
15 d.92.1
Metalloproteases ("zincins"), catalytic domain
14 b.29.1 Concanavalin A-like
lectins/glucanases 14
f.2.1 Membrane all-alpha
13 c.47.1
Thioredoxin-like
12 c.68.1 Nucleotide-diphospho-sug
ar transferases 12 c.2.1
NAD(P)-binding Rossmann-fold domains
11 a.4.1 Homeodomain-like
10
b.40.4 Nucleic acid-binding proteins
10 a.118.1 ARM repeat
10
71
A Little Science
  • Which families does the DNA binding-domain
    superfamily have?
  • The sccs of the superfamily is a.4.5. Its
    families have sccs a.4.5.1, a.4.5.2,, so how can
    we list them?

mysqlgt SELECT FROM cla LIMIT 1 --------------
--------------------------------------------
----------------- sid pdb_id sccs
cl cf sf fa dm sp
px -------------------------------------
--------------------------------------
d1dlwa_ 1dlw a.1.1.1 46456 46457
46458 46459 46460 46461 14982
------------------------------------------
--------------------------------- mysqlgt
SELECT FROM des LIMIT 1 -------------------
-------------------------- id type
sccs sid description
-------------------------------------------
-- 46456 cl a - All alpha
proteins ----------------------------------
-----------
72
A Little Science
  • Which families does the DNA binding-domain
    superfamily have?
  • SELECT DISTINCT sccs, description
  • FROM des
  • WHERE sccs LIKE a.4.5 AND typefa
  • ORDER BY sccs
  • sccs description
  • ------------------------------------------------
    ----------
  • a.4.5.1 Biotin repressor-like
  • a.4.5.10 Replication initiation protein
  • a.4.5.11 Helicase DNA-binding domain
  • a.4.5.12 Restriction endonuclease FokI,
    N-terminal (recognition) domain
  • a.4.5.13 Histone H1/H5
  • a.4.5.14 Forkhead DNA-binding domain
  • a.4.5.15 DNA-binding domain from rap30
  • a.4.5.16 C-terminal domain of RPA32
  • a.4.5.17 Cell cycle transcription factor
    e2f-dp
  • a.4.5.18 The central core domain of TFIIE
    beta
  • a.4.5.19 Z-DNA binding domain

73
A Little Science
  • Which families does the DNA binding-domain
    superfamily have?
  • ...
  • a.4.5.24 Iron-dependent represor protein
  • a.4.5.25 Methionine aminopeptidase, insert
    domain
  • a.4.5.26 mu transposase, DNA-binding domain
  • a.4.5.27 TnsA endonuclease, C-terminal domain
  • a.4.5.28 MarR-like transcriptional regulators
  • a.4.5.29 Plant O-methyltransferase,
    N-terminal domain
  • a.4.5.3 Arginine repressor (ArgR),
    N-terminal DNA-binding domain
  • a.4.5.30 C-terminal domain of the rap74
    subunit of TFIIF
  • a.4.5.31 DEP domain
  • a.4.5.32 Lrp/AsnC-like transcriptional
    regulator N-terminal domain
  • a.4.5.33 Thanscriptional regulator IclR,
    N-terminal domain
  • a.4.5.34 SCF ubiquitin ligase complex WHB
    domain
  • a.4.5.35 C-terminal fragment of elongation
    factor SelB
  • a.4.5.4 CAP C-terminal domain-like
  • a.4.5.5 ArsR-like transcriptional regulators
  • a.4.5.6 GntR-like transcriptional regulators

74
A Little Science
  • Which families does the DNA binding-domain
    superfamily have?
  • Lets find example pdbs
  • mysqlgt SELECT DISTINCT pdb_id FROM cla WHERE
    sccs"a.4.5.1"
  • --------
  • pdb_id
  • --------
  • 1bia
  • 1hxd
  • 1bib
  • 1j5y
  • --------
  • mysqlgt SELECT DISTINCT pdb_id FROM cla WHERE
    sccs"a.4.5.2"
  • --------
  • pdb_id
  • --------
  • 1jhf
  • 1jhh

75
A Little Science
1bia
1jhf
76
A Little Science
1cgp a.4.5.4
77
A Little Science
  • Some more

1smt, a.4.5.5
1b9n, a.4.5.8
1f4k, which family?
1f4k, a.4.5.7
1hw1, a.4.5.6
1bm9, a.4.5.7
78
A Little Science
  • How many percent of superfamilies have only 1
    family, how many 2, ?

79
A Little Science
  • How many percent of superfamilies have only
    1,2,3, families? First lets deposit the result
    of the query that found the number of families
    for each superfamily in a table
  • CREATE TABLE fa_freq AS
  • SELECT des.sccs, des.description,
    COUNT(DISTINCT cla.fa) AS num
  • FROM des, cla WHERE des.idcla.sf
  • GROUP BY cla.sf
  • ORDER BY num desc
  • Now we count how many superfamilies have 1,2,3,
    families
  • SELECT num AS fa_per_sf, COUNT() AS freq
  • FROM fa_freq
  • GROUP BY num

80
A Little Science
How many percent of superfamilies have only
1,2,3, families?
----------------- fa_per_sf perc
----------------- 1 0.75
2 0.13 3 0.05 4
0.02 5 0.02 6 0.01
7 0.00 8 0.00
9 0.00 10 0.00 11
0.00 12 0.00 13 0.00
14 0.00 15 0.00
18 0.00 19 0.00 20
0.00 23 0.00 35 0.00
-----------------
  • -----------------
  • fa_per_sf freq
  • -----------------
  • 1 981
  • 2 164
  • 3 65
  • 4 29
  • 5 25
  • 6 14
  • 7 6
  • 8 5
  • 9 1
  • 10 3
  • 11 1
  • 12 2
  • 13 1
  • 14 2
  • 15 1
  • 18 1

SELECT COUNT() FROM fa_freq ----------
count() ---------- 1305 ----------
SELECT num AS fa_per_sf, (COUNT()/1305)
AS perc FROM fa_freq GROUP BY num
81
A Little Science
  • How many percent of superfamilies have only
    1,2,3, families?
  • This is interesting! For the majority of
    superfamilies there is only one family!
  • What is the PDB structure with the largest number
    of (distinct) superfamilies?

82
A Little Science
  • What is the PDB structure with the largest number
    of (distinct) superfamilies?
  • SELECT pdb_id,
  • COUNT(sf) AS sf_num
  • FROM cla
  • GROUP BY pdb_id
  • ORDER BY sf_num DESC
  • LIMIT 10
  • ----------------
  • pdb_id sf_num
  • ----------------
  • 1aon 49
  • 1hto 48
  • 1ir2 48
  • 1htq 48
  • 1der 42
  • 1f49 40
  • 1jyy 40
  • 1gho 40
  • 1jyz 40
  • 1jz1 40

SELECT pdb_id, COUNT(DISTINCT sf)
AS distinct_sf_num FROM cla GROUP BY pdb_id
ORDER BY distinct_sf_num DESC LIMIT
10 ------------------------- pdb_id
distinct_sf_num -------------------------
1m1k 23 1k9m
23 1kd1 23 1kqs
23 1jj2 23 1k8a
23 1ffk 22
1i96 21 1hnz
20 1hr0 20
-------------------------
83
A structure with 23 different superfamilies
  • 1k9m Co Crystal Structure Of Tylosin Bound To The
    50S Ribosomal Subunit Of Haloarcula
    MarismortuiRibosome

84
A Little Science
  • Now lets plot how many PDBs have 1, 2, 3,
    distinct superfamilies
  • First of all let us put the result of the
    previous slide in a table (note if the table
    already exists we have to erase it first DROP
    TABLE pdb_sf_num. But be careful using DROP
  • CREATE TABLE pdb_sf_num AS
  • SELECT pdb_id,
  • COUNT(DISTINCT sf) AS distinct_sf_num
  • FROM cla
  • GROUP BY pdb_id
  • ORDER BY distinct_sf_num DESC
  • Now let us count how many PDBs with 1,2,3,
    distinct superfamilies

85
A Little Science
  • How many PDBs have 1, 2, 3, distinct
    superfamilies
  • SELECT distinct_sf_num, COUNT(pdb_id) AS num
  • FROM pdb_sf_num
  • GROUP BY distinct_sf_num
  • ORDER BY distinct_sf_num
  • ------------------------
  • distinct_sf_num num
  • ------------------------
  • 1 13960
  • 2 2721
  • 3 495
  • 4 178
  • 5 33
  • 6 25
  • 7 1
  • 9 4
  • 20 9
  • 21 1
  • 22 1

86
A Little Science
  • Lets do the same in percent
  • SELECT COUNT(DISTINCT pdb_id)
  • FROM cla
  • ----------
  • count()
  • ----------
  • 17434
  • ----------
  • There are 17434 PDB IDs

87
A Little Science
  • How many PDBs have 1, 2, 3, distinct
    superfamilies
  • SELECT distinct_sf_num, COUNT(pdb_id)/17434 AS
    perc
  • FROM pdb_sf_num
  • GROUP BY distinct_sf_num
  • ORDER BY distinct_sf_num
  • -----------------------
  • distinct_sf_num perc
  • -----------------------
  • 1 0.80
  • 2 0.16
  • 3 0.03
  • 4 0.01
  • 5 0.00
  • 6 0.00
  • 7 0.00
  • 9 0.00
  • 20 0.00
  • 21 0.00
  • 22 0.00

88
A Little Science
  • What are the most popular superfamilies?
  • I.e. for which are there the most PDB entries

89
A Little Science
  • What are the most popular superfamilies?
  • SELECT des.sccs, des.description,
  • COUNT(DISTINCT cla.pdb_id) AS
    num_of_pdb_ids
  • FROM cla,des
  • WHERE des.idcla.sf
  • GROUP BY cla.sf
  • ORDER BY num_of_pdb_ids DESC
  • LIMIT 10
  • ------------------------------------------------
    ------------------------------
  • sccs description
    num_of_pdb_ids
  • ------------------------------------------------
    ------------------------------
  • b.1.1 Immunoglobulin
    823
  • d.2.1 Lysozyme-like
    777
  • b.47.1 Trypsin-like serine proteases
    649
  • c.37.1 P-loop containing nucleotide
    triphosphate hydrolases 521
  • c.2.1 NAD(P)-binding Rossmann-fold domains
    384
  • a.1.1 Globin-like
    384
  • c.1.8 (Trans)glycosidases
    332
  • b.50.1 Acid proteases
    288

90
A Little Science
  • Are all superfamilies equally likely to co-occur?
  • Let us generate a co-occurrence map as an answer
  • Which superfamilies co-occur most frequently
  • Which superfamilies have the most co-occurrence
    partners
  • The co-occurrence map should consist of two
    tables
  • A table with PDB ID, superfamily 1, superfamily 2
    (to avoid repetition we will require that sf1 is
    alphabetically before sf2)
  • A table with superfamily 1 and 2 and the number
    of PDBs containing this co-occurrence

91
A Little Science
  • Co-occurrence map
  • SELECT DISTINCT c1.pdb_id, c1.sf, c2.sf
  • FROM cla AS c1, cla AS c2
  • WHERE c1.pdb_idc2.pdb_id AND c1.sfltc2.sf
  • LIMIT 10
  • ----------------------
  • pdb_id sf sf
  • ----------------------
  • 1cqx 46458 63380
  • 1cqx 46458 52343
  • 1gvh 46458 63380
  • 1gvh 46458 52343
  • 1b33 46458 54580
  • 1qgw 46458 56568
  • 1kf6 46548 46977
  • 1kf6 46548 51905
  • 1kf6 46548 54292
  • 1kf6 46548 56425
  • ----------------------

92
A Little Science
  • Co-occurrence Map
  • CREATE TABLE cooc AS
  • SELECT DISTINCT c1.pdb_id,
  • c1.sf AS sf1, d1.description AS sf1name,
  • c2.sf AS sf2, d2.description AS sf2name
  • FROM cla AS c1, cla AS c2, des AS d1, des AS d2
  • WHERE c1.pdb_idc2.pdb_id AND c1.sfltc2.sf AND
  • c1.sfd1.id AND c2.sfd2.id
  • ----------------------------------------------
    -------------------------------------------------
    ---------------
  • pdb_id sf1 sf1name sf2
    sf2name
  • ----------------------------------------------
    -------------------------------------------------
    ---------------
  • 1cqx 46458 Globin-like
    63380 Riboflavin synthase domain-like
  • 1cqx 46458 Globin-like
    52343 Ferredoxin reductase-like, C-terminal
    NADP-linked domain
  • 1gvh 46458 Globin-like
    63380 Riboflavin synthase domain-like
  • 1gvh 46458 Globin-like
    52343 Ferredoxin reductase-like, C-terminal
    NADP-linked domain
  • 1b33 46458 Globin-like
    54580 Allophycocyanin linker chain (domain)
  • 1qgw 46458 Globin-like
    56568 Non-globular alphabeta subunits of
    globular proteins
  • 1kf6 46548 alpha-helical ferredoxin
    46977 Succinate dehydrogenase/fumarate
    reductase C-terminal domain

93
A Little Science
  • Number of instances in co-occurrence map
  • SELECT COUNT(DISTINCT pdb_id) AS num,
  • sf1, sf1name, sf2, sf2name
  • FROM cooc
  • GROUP BY sf1,sf2
  • ORDER BY num DESC
  • LIMIT 10
  • -----------------------------------------------
    ------------------------------------------------
    -------------------------------------
  • num sf1 sf1name
    sf2 sf2name
  • -----------------------------------------------
    ------------------------------------------------
    -------------------------------------
  • 137 48726 Immunoglobulin
    54452 MHC antigen-recognition
    domain
  • 125 51011 alpha-Amylases, C-terminal
    beta-sheet domain 51445 (Trans)glycosidases
  • 117 47616 Glutathione S-transferases,
    C-terminal domain 52833 Thioredoxin-like
  • 99 47802 DNA polymerase beta, N-terminal
    domain-like 56699 Nucleotidyltransferases
  • 97 53098 Ribonuclease H-like
    56672 DNA/RNApolymerases
  • 74 51735 NAD(P)-binding Rossmann-fold
    domains 55347 Gly
Write a Comment
User Comments (0)
About PowerShow.com