Title: Database Design, Implementation, and Programming
1RiceCAP workshop, June 4 9, 2006 Noble
Foundation Ardmore, Oklahoma
Basic data relations for markers, traits, and
genotypes
Clare Nelson Dept. of Plant Pathology, Kansas
State University
2Plan
- Databases what and why?
- What's a relational database?
- Designing a database for Student data
- Identifying the main data elements
- Identifying their relations
- Creating, populating, and querying a MySQL
database - Lab Do it yourself (with guidance!)
3What you'll know after lecture and lab
- What are data relations?
- What is a relational database (RDB)?
- What's a schema?
- and how do we design one?
- How do we get data into and out of a RDB?
- What's SQL and how do we use it?
- And a free bonus tip ? ? ? ?
4How to use your SQL knowledge in the real world
5What is a database?
- (Abstract) A structure for storage and retrieval
of data - (Concrete) An instance of such a structure,
containing data - Is an Excel spreadsheet a database?
- Is a WWW page a database?
- What about a telephone directory?
6Is a telephone directory a database?
- Yes, in a limited way.
- It is designed for just one query
- Find the address and telephone number of person
X. - What if we want
- the names of all residents in the 1700 block of
Main St. - the address with telephone number 222-2222
7For our purposes, what is a DB?
- A software structure for data that explicitly
describes all the relations between the data
types. - We'll call this a relational database or RDB.
- What's a data type? What's a relation?
- We'll see by example.
8What software manages a RDB?
- Microsoft Access, Oracle, Sybase, MySQL,
PostgreSQL, DB2, Informix... - are database management systems (DBMSs).
- They are not databases
- A DBMS is software used to create, populate,
destroy, query, and administer DBs.
9Creating a simple database
- What data do I have (in Excel, say)?
---------------------------------------------
--------------------------------- ID Name
Major Nationality Advisor_lname
Advisor_fname Advisor_dept ----------------
-----------------------------------------------
--------------- 0 Guo GEN China
Nelson Clare Plant
Path 0 Guo GEN China
Jannink Jean-Luc Agronomy
1 Mosquera PLPTH Colombia Valent
Barbara Plant Path 2 Smith
STAT USA Nelson Lloyd
Statistics 3 Johnson STAT
USA Evert Sam
Philosophy 3 Johnson STAT USA
Nelson Lloyd Statistics
-------------------------------------------
-----------------------------------
10Let's build a MySQL database
- The MySQL server is running on my PC
- At the command line, type
- mysql -u ltusernamegt -p
- enter password
- I'm now running the MySql DBMS.
- show databases
- create database ltXXXgt
- use ltXXXgt
- show tables
11Table
- The unit of data organization in a RDB
- Describes a data type or entity
- Each column describes an attribute
- Sample table (a very bad one!)
---------------------------------------------
--------------------------------- ID Name
Major Nationality Advisor_lname
Advisor_fname Advisor_dept ----------------
-----------------------------------------------
--------------- 0 Guo GEN China
Nelson Clare Plant
Path 0 Guo GEN China
Jannink Jean-Luc Agronomy
1 Mosquera PLPTH Colombia Valent
Barbara Plant Path 2 Smith
STAT USA Nelson Lloyd
Statistics 3 Johnson STAT
USA Evert Sam
Philosophy 3 Johnson STAT USA
Nelson Lloyd Statistics
-------------------------------------------
-----------------------------------
12Creating a table
- CREATE TABLE STUDENT
- ( ID INT(4) NOT NULL,
- Name VARCHAR(10),
- Major VARCHAR(5),
- Nationality VARCHAR(10),
- Advisor_lname VARCHAR(10),
- Advisor_fname VARCHAR(10),
- Advisor_dept VARCHAR(10)
- )
13Viewing the table structure
---------------------------------------------
--------------- Field Type
Null Key Default Extra
--------------------------------------------
---------------- ID int(4)
NO Name
varchar(10) YES
Major varchar(5) YES
Nationality
varchar(10) YES
Advisor_lname varchar(10) YES
Advisor_fname
varchar(10) YES
Advisor_dept varchar(10) YES
--------------------------------
----------------------------
14Populating the STUDENT table
- INSERT INTO STUDENT VALUES (0, 'Guo', 'GEN',
'China', 'Nelson', 'Clare', 'Plant Path') - INSERT INTO STUDENT VALUES (0, 'Guo', 'GEN',
'China', 'Jannink', 'Jean-Luc', 'Agronomy') - INSERT INTO STUDENT VALUES (1, 'Mosquera',
'PLPTH', 'Colombia', 'Valent', 'Barbara', 'Plant
Path') - INSERT INTO STUDENT VALUES (2, 'Smith', 'STAT',
'USA', 'Nelson', 'Lloyd', 'Statistics') - INSERT INTO STUDENT VALUES (3, 'Johnson', 'STAT',
'USA', 'Evert', 'Sam', 'Philosophy') - INSERT INTO STUDENT VALUES (3, 'Johnson', 'STAT',
'USA', 'Nelson', 'Lloyd', 'Statistics')
15Querying the database
- SQL (Structured Query Language)
- is how we
- insert
- query
- delete
- alter
- sort
- calculate
- and do all other data operations
16Use SELECT for querying
---------------------------------------------
--------------------------------- ID Name
Major Nationality Advisor_lname
Advisor_fname Advisor_dept ----------------
-----------------------------------------------
--------------- 0 Guo GEN China
Nelson Clare Plant
Path 0 Guo GEN China
Jannink Jean-Luc Agronomy
1 Mosquera PLPTH Colombia Valent
Barbara Plant Path 2 Smith
STAT USA Nelson Lloyd
Statistics 3 Johnson STAT
USA Evert Sam
Philosophy 3 Johnson STAT USA
Nelson Lloyd Statistics
-------------------------------------------
-----------------------------------
17A query with conditions
---------------------------------------------
--------------------------------- ID Name
Major Nationality Advisor_lname
Advisor_fname Advisor_dept ----------------
-----------------------------------------------
--------------- 0 Guo GEN China
Nelson Clare Plant
Path 0 Guo GEN China
Jannink Jean-Luc Agronomy
1 Mosquera PLPTH Colombia Valent
Barbara Plant Path 2 Smith
STAT USA Nelson Lloyd
Statistics 3 Johnson STAT
USA Evert Sam
Philosophy 3 Johnson STAT USA
Nelson Lloyd Statistics
-------------------------------------------
-----------------------------------
Show names and majors of students whose
nationality is Chinese or whose major is STAT.
Name, Major STUDENT Nationality 'China' Major
'STAT'
18Is this a "good" database?
---------------------------------------------
--------------------------------- ID Name
Major Nationality Advisor_lname
Advisor_fname Advisor_dept ----------------
-----------------------------------------------
--------------- 0 Guo GEN China
Nelson Clare Plant
Path 0 Guo GEN China
Jannink Jean-Luc Agronomy
1 Mosquera PLPTH Colombia Valent
Barbara Plant Path 2 Smith
STAT USA Nelson Lloyd
Statistics 3 Johnson STAT
USA Evert Sam
Philosophy 3 Johnson STAT USA
Nelson Lloyd Statistics
-------------------------------------------
-----------------------------------
- Efficiently maintained and queried?
- data easily added, dropped, changed?
- If you correct or update repeated data, how can
you be sure you've found every instance of the
bad data? - With multiple records for a student, how can you
be sure you've found the information you want?
19To reduce redundancy pack multiple data into
fields?
---------------------------------------------
----------------------------------------- ID
Name Major Nationality Advisor_lname
Advisor_fname Advisor_dept
-------------------------------------------
-------------------------------------------
0 Guo GEN China Nelson,
Jannink Clare, Jean-Luc Plant Path, Agronomy
3 Smith STAT USA Nelson
Lloyd Statistics
etc. etc. ---------------------------------
---------------------------------------------
- Why is this a bad idea?
- It's hard to get at the attributes of the packed
data - Adding/removing data requires updating old data
20Why not just add attributes?
- Must know number of advisors in advance of
designing the database - Can't easily find all advisors of student X
- Advisor data what's it doing in the same table
with student, anyway?
21Uncoupling unrelated data
---------------------------------------------
--------------------------------- ID Name
Major Nationality Advisor_lname
Advisor_fname Advisor_dept ----------------
-----------------------------------------------
--------------- 0 Guo GEN China
Nelson Clare Plant
Path 1 Guo GEN China
Jannink Jean-Luc Agronomy
2 Mosquera PLPTH Colombia Valent
Barbara Plant Path 3 Smith
STAT USA Nelson Lloyd
Statistics 4 Johnson STAT
USA Evert Sam
Philosophy 5 Johnson STAT USA
Nelson Lloyd Statistics
-------------------------------------------
-----------------------------------
- It's reasonable to separate advisor from student!
- if we want to change or add an attribute to
advisor, why should we have to update student
records? - How can we reduce unwanted dependencies?
22Relational database idea
- Developed by E. F. Codd in 1970
- Records in tables are linked by relations between
entities - The structure of an RDB is called the data model
or schema
23Thinking about relations
- Can a student have gt 1 advisor?
- an advisor have gt 1 student?
- Redundancy comes from data types that don't have
1-to-1 relations! - Can we use relations to develop rules for
designing a schema? - Let's list the possible relations....
24What is a data relation?
- A rule of association. There are only three
- One to one (11)
- One to many ( many to one) (1M, M1)
- Many to many (MM)
- Of what type are the following relations?
- Mother, child
- Student, faculty advisor
- House, street address
- Food prices, weather forecast
- genetic marker, genome map
1M MM 11 ?? MM
25Designing a schema
- Decide on the main data entities and their
attributes - Determine their relations
- Create a table for each entity
- Add columns for attributes
- Add columns that link tables according to the
data relations - Add tables to reduce redundancy
26What are the main entities in the Student data?
---------------------------------------------
--------------------------------- ID Name
Major Nationality Advisor_lname
Advisor_fname Advisor_dept ----------------
-----------------------------------------------
--------------- 0 Guo GEN China
Nelson Clare Plant
Path 0 Guo GEN China
Jannink Jean-Luc Agronomy
1 Mosquera PLPTH Colombia Valent
Barbara Plant Path 2 Smith
STAT USA Nelson Lloyd
Statistics 3 Johnson STAT
USA Evert Sam
Philosophy 3 Johnson STAT USA
Nelson Lloyd Statistics
-------------------------------------------
-----------------------------------
- A "main entity" one with attributes of its own)
- Student
- with ID, name, major, nationality, advisors
- Advisor
- with first name, last name, department
27First table-creation decisions
---------------------------------------------
--------------------------------- ID Name
Major Nationality Advisor_lname
Advisor_fname Advisor_dept ----------------
-----------------------------------------------
--------------- 0 Guo GEN China
Nelson Clare Plant
Path 0 Guo GEN China
Jannink Jean-Luc Agronomy
1 Mosquera PLPTH Colombia Valent
Barbara Plant Path 2 Smith
STAT USA Nelson Lloyd
Statistics 3 Johnson STAT
USA Evert Sam
Philosophy 3 Johnson STAT USA
Nelson Lloyd Statistics
-------------------------------------------
-----------------------------------
- Suppose we start by creating STUDENT and ADVISOR
tables.
28Create a new STUDENT table
- drop table STUDENT
- CREATE TABLE STUDENT
- ( ID INT(4) NOT NULL,
- Name VARCHAR(10),
- Major VARCHAR(5),
- Nationality VARCHAR(10),
- )
- Advisor_lname VARCHAR(10),
- Advisor_fname VARCHAR(10),
- Advisor_dept VARCHAR(10)
- (leave these out for now .... why?)
29Repopulate STUDENT table
- INSERT INTO STUDENT VALUES (0, 'Guo', 'GEN',
'China') - INSERT INTO STUDENT VALUES (1, 'Mosquera',
'PLPTH', 'Colombia') - INSERT INTO STUDENT VALUES (2, 'Smith', 'STAT',
'USA') - INSERT INTO STUDENT VALUES (3, 'Johnson', 'STAT',
'USA')
select from STUDENT ---------------------
--------- ID Name Major Country
------------------------------ 0 Guo
GEN China 1 Mosquera PLPTH
Colombia 2 Smith STAT USA 3
Johnson STAT USA ---------------
---------------
30Table ADVISOR
- CREATE TABLE ADVISOR
- ( ID INT(4) NOT NULL,
- LName VARCHAR(10),
- FName VARCHAR(5),
- Department VARCHAR(20)
- )
- Why do we always put an ID field in these tables?
31Populating the ADVISOR table
- INSERT INTO ADVISOR VALUES (0, 'Nelson', 'James',
'Plant Pathology') - INSERT INTO ADVISOR VALUES (1, 'Valent',
'Barbara', 'Plant Pathology') - INSERT INTO ADVISOR VALUES (2, 'Nelson', 'Lloyd',
'Statistics') - INSERT INTO ADVISOR VALUES (3, 'Evert', 'Sam',
'Philosophy') - INSERT INTO ADVISOR VALUES (4, 'Jannink',
'Jean-Luc', 'Agronomy') - select from ADVISOR
------------------------------------- ID
LName FName Department
------------------------------------- 0
Nelson James Plant Pathology 1
Valent Barba Plant Pathology 2 Nelson
Lloyd Statistics 3 Evert Sam
Philosophy 4 Jannink Jean-
Agronomy ----------------------------
---------
32Making trouble with SQL
- Also try to add duplicate data
- INSERT INTO ADVISOR VALUES (0, 'Nelson', 'James',
'Plant Pathology') - INSERT INTO ADVISOR VALUES (1, 'Valent',
'Barbara', 'Plant Pathology') - What happened?
- MySQL didn't prevent us from ruining the
nonredundant design... - We'll see how to do this later.
33How do we link students with advisors?
- We would like to store the advisors of each
student and the students of each advisor... - But with a MM relation, we can't store these data
in either table.... - (Why not?)
- What do we do?
34First, an easier problem 1M and M1 relations
- Suppose we decide to place Nationality in its own
table. - (Why might we want to do this?
- CREATE TABLE NATIONALITY
- ( ID INT(4) NOT NULL,
- Country VARCHAR(25)
- )
- INSERT INTO NATIONALITY VALUES (0, 'China')
- INSERT INTO NATIONALITY VALUES (1, 'Colombia')
- INSERT INTO NATIONALITY VALUES (2, 'USA')
------------ ID Country
-------------- 0 China 1
Colombia 2 USA --------------
35Rebuild the STUDENT table...
- drop table STUDENT
- ... and replace Nationality with a numerical key!
- CREATE TABLE STUDENT
- ( ID INT(4) NOT NULL,
- Name VARCHAR(10),
- Major VARCHAR(5),
- Nat_ID INT(4)
- )
36Repopulate the STUDENT table
- INSERT INTO STUDENT VALUES (0, 'Guo', 'GEN', 0)
- INSERT INTO STUDENT VALUES (1, 'Mosquera',
'PLPTH', 1) - INSERT INTO STUDENT VALUES (2, 'Smith', 'STAT',
2) - INSERT INTO STUDENT VALUES (3, 'Johnson', 'STAT',
2) - select from STUDENT
----------------------------- ID Name
Major Nat_ID --------------------------
--- 0 Guo GEN 0 1
Mosquera PLPTH 1 2 Smith
STAT 2 3 Johnson STAT 2
-----------------------------
37Now how do we find the nationalities of students?
NATIONALITY table ------------ ID
Country -------------- 0 China
1 Colombia 2 USA
--------------
STUDENT table -----------------------------
ID Name Major Nat_ID
----------------------------- 0 Guo
GEN 0 1 Mosquera PLPTH 1
2 Smith STAT 2 3
Johnson STAT 2 -------------------
----------
Name, Country STUDENT, NATIONALITY (is this
enough?) STUDENT.Nat_ID NATIONALITY.ID
-------------------- Name Country
-------------------- Guo China
Mosquera Colombia Smith USA
Johnson USA --------------------
This is called a join operation.
38The join operation and keys
- Because storing 1M data types in one table breaks
our One Data Instance gt One Record rule, we - place the two data types into two tables
- specify their relations with primary and foreign
keys - use a join query to extract attributes from both
tables at once
39Primary and foreign keys
- Primary key unique identifier (generally a
number) for each record in a table. - Foreign key an identifier in one table that
matches the primary key of another table - It serves to join the two tables.
- It's the address of a record, rather than a copy
of the data in the record. - This decouples the attributes of two different
data types.
40Primary foreign keys example
A primary key
A foreign key
STUDENT table -----------------------------
ID Name Major Nat_ID
----------------------------- 0 Guo
GEN 0 1 Mosquera PLPTH 1
2 Smith STAT 2 3
Johnson STAT 2 -------------------
----------
NATIONALITY table ------------ ID
Country -------------- 0 China
1 Colombia 2 USA
--------------
41Many-to-many relations
- We handled the 1M relation
- 1 nationality ltgt many students.
- How do we handle the MM relation
- many students ltgt many advisors?
42Table-design rules
- 11 relations data types may be placed in the
same table - 1M or M1 create two tables, with primary and
foreign keys - MM create three tables, with two primary keys
and a composite foreign key.
43Handling the student advisor relation
- CREATE TABLE STUDENT
- ( ID INT(4) NOT NULL,
- Name VARCHAR(10),
- Major VARCHAR(5),
- Nat_ID INT(4),
- Primary key (ID),
- Foreign key (Nat_ID) references NATIONALITY(ID)
- )
- CREATE TABLE ADVISOR
- ( ID INT(4) NOT NULL,
- LName VARCHAR(10),
- FName VARCHAR(5),
- Department VARCHAR(20),
- Primary key (ID)
- )
CREATE TABLE STUDENT_ADVISOR ( Student_ID INT(4)
NOT NULL, Advisor_ID INT(4) NOT NULL, Is_major
CHAR(1), primary key (Student_ID,
Advisor_ID), foreign key (Student_ID) references
STUDENT(ID), foreign key (Advisor_ID) references
ADVISOR(ID) )
Notice that we are telling MySQL about the
keys. What will MySQL do with this?
44Populating the keyed tables
- INSERT INTO STUDENT VALUES (0, 'Guo', 'GEN', 0)
- INSERT INTO STUDENT VALUES (1, 'Mosquera',
'PLPTH', 1) - INSERT INTO STUDENT VALUES (2, 'Smith', 'STAT',
2) - INSERT INTO STUDENT VALUES (3, 'Johnson', 'STAT',
2) - INSERT INTO ADVISOR VALUES (0, 'Nelson', 'James',
'Plant Pathology') - INSERT INTO ADVISOR VALUES (1, 'Valent',
'Barbara', 'Plant Pathology') - INSERT INTO ADVISOR VALUES (2, 'Nelson', 'Lloyd',
'Statistics') - INSERT INTO ADVISOR VALUES (3, 'Evert', 'Sam',
'Philosophy') - INSERT INTO ADVISOR VALUES (4, 'Jannink',
'Jean-Luc', 'Agronomy') - Try repeating the last command. What happens?
45Populating the join table
INSERT INTO STUDENT_ADVISOR VALUES (0, 0,
1) INSERT INTO STUDENT_ADVISOR VALUES (0, 4,
0) INSERT INTO STUDENT_ADVISOR VALUES (1, 1,
1) INSERT INTO STUDENT_ADVISOR VALUES (2, 2,
1) INSERT INTO STUDENT_ADVISOR VALUES (3, 3,
0) INSERT INTO STUDENT_ADVISOR VALUES (3, 2, 1)
select from STUDENT_ADVISOR -----------------
----------------- Student_ID Advisor_ID
Is_major ----------------------------------
0 0 1 0
4 0 1 1
1 2 2 1
3 3 0 3
2 1
----------------------------------
46Tables set up for MM relation
describe STUDENT ---------------------------
----- Field Type Null Key
-------------------------------- ID
int(4) NO PRI Name
varchar(10) YES Major varchar(5)
YES Nat_ID int(4) YES
MUL --------------------------------
- describe STUDENT_ADVISOR
- --------------------------------
- Field Type Null Key
- --------------------------------
- Student_ID int(4) NO PRI
- Advisor_ID int(4) NO PRI
- Is_major char(1) YES
- --------------------------------
describe ADVISOR ----------------------------
-------- Field Type Null
Key ------------------------------------
ID int(4) NO PRI LName
varchar(10) YES FName
varchar(5) YES Department
varchar(20) YES ---------------------
---------------
How would you extract the names of all advisors
of student named Guo?
47Who are all Guo's advisers?
- SELECT
- FROM
- WHERE
- AND
- AND
Name, LName, FName STUDENT, ADVISOR,
STUDENT_ADVISOR STUDENT.Name 'Guo' STUDENT.ID
STUDENT_ADVISOR.Student_ID ADVISOR.ID
STUDENT_ADVISOR.Advisor_ID
48A joining query
How to select all the students who have James
Nelson as an advisor?
- SELECT
- FROM
- WHERE
- AND
- AND
- AND
Name, LName STUDENT, ADVISOR, STUDENT_ADVISOR LNam
e "Nelson" FName "Lloyd " ADVISOR_ID
ADVISOR.ID STUDENT_ID STUDENT.ID
49Summary
- RDBs incorporate 11, 1M/M1, and MM relations
among data types - Data are stored economically
- with each table describing one data "entity" and
its attributes - and key fields encoding the relations between
entities - Data are extracted with SQL SELECT statements
- and data from different tables are extracted with
joins
50Categories of SQL statement
- Data definition (DDL)
- CREATE
- ALTER
- DROP
- RENAME
- TRUNCATE
- Data manipulation (DML)
- INSERT
- UPDATE
- DELETE
- Data control (DCL)
- GRANT
- REVOKE
- Data retrieval (DQL)
- SELECT
- Transaction control (TC)
- COMMIT
- ROLLBACK
51Elements of SQL statements
- A verb, such as SELECT (Do what?)
- A predicate object that specifies field names in
tables ( means all fields) (To what?) - A prepositional clause describing the table on
which the verb acts (FROM tablename) (Where?) - A conditional or adverbial clause (WHERE Advisor
"Nelson") (How?) - Various useful helping verbs such as ORDER BY
52How MySQL stores numbers
- INTEGER A whole number
- VARCHAR(10) Up to 10 characters
- CHAR(10) Fixed number of characters
- DATE A date
- DATETIME Date and time
- FLOAT Floating-point number
- TEXT allows up to 65535 characters
- DECIMAL(10, 2) Up to 10 digits before the point,
2 after - Note Oracle uses different number types. ALL
DBMSs are not interchangeable!
53Customer order systementity relationship
diagram