Title: Information Infrastructure II
1Information Infrastructure II
- I211 Week 11
- Rajarshi Guha
2Outline
- Why use a database
- Terminology
- Basic SQL
- Working with Postgres
- The Python DB-API
3Storing Data
- Data comes from many places
- Medical records
- Credit card information
- Molecular structures
- Gene sequences
- Inventories
- Sizes of data sets can vary
- lt 50 Kb (30 student records)
- gt 9GB (ATTs switch information for 1 day)
4Storing Data
- Depends on size and need
- Small datasets could be handled using Notepad
- List of attendees for a party
- Bigger datasets could be handled in Excel
- Class grades
- Personal finance information
5Storing Data
- For very large datasets we need specialized
software - Database
- But databases can be used even for small datasets
- Dont use Excel as a substitute for a database!
6Types of Databases
- Flatfile
- Essentially an on-disk dictionary
- e.g. BerkeleyDB
- Relational
- What well focus on
- e.g. PostgreSQL, MySQL, Access, Oracle
- XML
- e.g. Xindice, Sedna
- Object oriented
7What Does a Database Do?
- Efficiently store different types of data
- Maintain integrity of your data
- Efficiently extract data based on certain
conditions - Databases are not meant to do
- Complex calculations
- Presentation of results (i.e. interfaces)
8Types
- Data can be of different types
- Dates
- Names
- Age
- Price of a product
- JPEG image
9Types
- Depending on the type you can do different things
- Dates - find the number of days since the
patients last visit - Names - find a person whose name starts with S
- Age - find a person whose age is gt 20
- Price - find all products which cost lt 20
10Types
- Depending on the type, there are some things that
dont make sense - An age cant be less than 0
- A price cant be less than 0
- A name cant be an empty string
- A date cant be before the year 1900
- Note, some of the conditions are valid in an
absolute sense - But depending on your problem they may not be
sensible - Does a credit card application care about the
year 1900?
11Data Integrity
- The DB should not allow you to mess up your data
- The data should always make sense
- Examples
- A DB for patients, should not repeat the details
of a given patient - If you delete a patient, then all data related to
that patient should be deleted - Should not be able to set a library due date
before todays date
12Data Extraction
- This is the rationale for databases
- Store large amounts of data
- Query the data to get what you want
- Find all patients admitted since last Thursday
- Find all students who are taking I211 andwho
have also taken CS398 andwho have done an
internship - Find all stocks that have a market cap of X
andwhose volume was equal to the max volume last
month or whose PE ratio dipped below the min PE
ratio last year
13Data Extraction
- Conceptually not difficult to do
- You can already do this with a text or XML file
and a bit of Python code - But I have 1,000,000 records and I want my answer
in under 5 sec - Ordinary Python programs wont be able to do that
- Takes more than 5 sec just to read a million
lines! - Time to move to the database
14Relational Databases
- Different pieces and types of data can be related
- A Person is related to Address
- For a Person, there will be many instances -
i.e., individual people - Similarly for addresses
15Terminology
- A relation is also known as a table and
represents an entity - Entity is essentially the same as a concept.
Could be - A person
- A car
- A product
- Tables will have one or more columns (fields)
- Also known as attributes
- A record is one row from the table. Also called a
tuple
16The R in RDBMS
- Generally, youll have multiple tables
- Tables will have relationship between them
- Choosing how many tables should be used and what
goes into them is part of database design - Quite a complex topic
- Well established rules for good designs
17The R in RDBMS - Medical Records
- Physician - id number, name, etc
- Mother - id number, weight etc
- Baby - id number, date delivered, wt
http//audilab.bmed.mcgill.ca/funnell/Bacon/DBMS/
dbms.html
18The R in RDBMS - Medical Records
- Physician related to Mother
- A mother will be attended by one physician
- Mother related to Baby
- A given baby has one mother
- Physician could be related to Baby
- Depends on theproblem
http//audilab.bmed.mcgill.ca/funnell/Bacon/DBMS/
dbms.html
19Postgres
- One of the popular OSS databases
- Robust, powerful, stable
- If you write standard SQL it will work on any
modern RDBMS - Each RDBMS has some quirks
- Names of column types
- Showing tables
- Loading data
20Postgres
- Well focus on using standard SQL
- Ill provide information on working with Postgres
- You should be able to transfer the knowledge to
MySQL or SQLServer or Oracle if required
21Logging Into Postgres
- ssh to Sulu
- At the prompt psql python211 -U i211
- The name of the database is python211
- Password is I211pwd
- Note - All of you work in the same database.
Respect others!
22Creating a Table
- Provide a table name
- Decide on the attributes (columns, fields) for
the table - Decide on the types of the columns
- Write the SQL
23Creating a Table
- Dont use capitals
- Can include as many columns you want
- Dictated by the design
- Usually need to add constraints
- List the tabledefinition using\d tablename
create table mytable ( id integer, firstname
text, lastname text, age integer)
24Creating a Table
- python211gt \d mytable
- Table "public.mytable"
- Column Type Modifiers
- -------------------------------
- id integer
- firstname text
- lastname text
- age integer
See http//www.postgresql.org/docs/7.4/interactive
/datatype.html
25Dropping a Table
- If a table with the same name exists, then create
table will fail - Drop the table first using drop table tablename
26Simple Constraints
- Currently, we can enter an age of -5
- Doesnt make sense
- We could try and discipline ourselves and not put
in such values - We might make a mistake
- We should let the database enforce such
requirements - Constraints
27Simple Constraints
- Types of constraints
- Check - limits the possible values
- Not null - ensures that something cannot be null
- Default values - specifies a value if nothing is
specified - Unique - indicates that a column will have unique
values (i.e., no repeats)
See http//www.postgresql.org/docs/7.4/interactive
/ddl-constraints.htmlAEN1898
28Simple Constraints
The id column can only have unique values
- create table mytable (
- id integer unique,
- firstname text not null,
- lastname text not null,
- age integer default 10,
- check (age gt 0)
- )
Can not have a NULL value
If we dont specify an age when inserting data,
its set to 10
Can never insert a negative age
29Inserting Data
- General format insert into tablename
values (val1, val2, , valN) - You generally have to provide values for each of
the columns you defined for the table - Constraints may prevent you from inserting
certain values
30Inserting Data
- We said that age has to be a positive non-zero
number - Cant insert two rows with the same value of id
python211gt insert into mytable values (1,
'John', 'Smith', 0) ERROR new row for relation
"mytable" violates check constraint "1"
python211gt insert into mytable values (1,
'John', 'Smith', 10) INSERT 17186 1 python211gt
insert into mytable values (1, 'Jane', 'Smith',
12) ERROR duplicate key violates unique
constraint "mytable_id_key
31Inserting Data
- Since we have set default values, we dont need
to specify all the column values - But we have only specified a default value for
age, not the other columns
python211gt insert into mytable values (2,
'Jeremy', 'Fish') INSERT 17188 1 python211gt
insert into mytable values (2) ERROR null
value in column "firstname" violates not-null
constraint
32Inserting Data
- Good practice is to specify the column names for
which you are inserting data
python211gt insert into mytable (id, firstname,
lastname, age) values python211-gt (3,
'Richard', 'Block', 22) INSERT 17189 1
33Selecting Data
- Weve been putting data into the table
- How can we see what we put in?
- All of what we put in
- Some of what we put in
- One of the primary tasks in using a databases is
to select rows
34Selecting Data
- Select all the columns for all the rows from a
table select from mytable - Select age and last name for all rows
- select lastname, age from mytable
- Most common error is to misspell a column or
table name
35Selecting Certain Rows
- Usually a table will contain 1000s or millions
of rows - Selecting all rows is not helpful
- We select certain rows by using a condition
36Selecting Certain Rows
- Select all people whose age lt 10 select
firstname, lastname from mytable where age lt 10 - Select all people between 10 and 22 select
firstname, lastname from mytable where age gt 10
and age lt 22
37Selecting Certain Rows
- You can select based on text
- Select all rows with a certain first name select
from mytable where firstname Robert - Select all rows with a last name starting with
S select from mytable where lastname S - This is an example of using regexes in a select
condition
38Deleting Rows
- Youve already seen how to delet a whole
table drop table mytable - To delete all rows from a table delete from
mytable - Be very careful!
- Always validate input in database applications!
- The table itself still remains, so you can add
new rows etc
39Deleting Rows
- You can delete certain rows by using conditions
- Similar to how we use select and conditions
- delete from mytable where age lt 10
- You can check whether the delete did what you
want by doing a select with the same condition - If you did do it wrong, cant get back what you
deleted!
40Updating Rows
- Sometimes you want to just change the value of
one column in a row (or multiple rows) - To do an updat you need three pieces of
information - Name of table and column
- New value of the column
- Which row(s) to update
41Updating Rows
- For anybody whose age lt 15, set their age to 74
- For anybody whose first name starts with a J,
set their age to 34and last name to Blah
update mytable set age 74 where age lt
15 update mytable set age 34, lastname
'Blah' where firstname 'J'
42More on Select
- The database is designed for fast searching
- Does require some help from you when dealing with
large datasets - So, we should try and do as much manipulation
within the database - Getting min and max
- Sorting
- Getting the top 10 results
43Sorting Results
- You sort the results of a query by a specific
column - The column should exist in the table being
queried
select from rguha_scores where age lt 30 order
by timing select from rguha_scores where age
lt 30 order by timing desc select id, age from
rguha_scores where age lt 30 order by timing
44Limiting the Results
- Sometimes well get back thousands of rows
- Cumbersome to view
- May want just the top 10 or 20
- Use limit to control how many results you need
45Limiting the Results
- You can use limit to get just a few rows
- Note that the DB does not always store the data
in the order that you input it
select from rguha_scores limit 10 select
from rguha_scores where age gt 18 and age lt 33
order by timing limit 5
46The Python DB-API
- A consistent way to access different databases
- If your program uses this, your application can
use any database supported by the API - Just need to change 2 lines
- Hides much of the database internals
- Allows you to interact in a Pythonic way
- A row is a Python tuple
47The Python DB-API
- You need to construct SQL statements as strings
- The API will allow you to
- Send a query
- Get the results of a query
- One at a time
- All together
- Commit transactions
48Committing Transactions
- When you insert data, it might take more than one
step - What happens if the power goes of between the
steps? - You have part of the data inserted
- The rest is not in the DB
- You record is not valid anymore
49Committing Transactions
- Operations should be atomic
- All the individual statements must complete
successfully for the whole thing to be successful - If one of the steps fails, the database goes back
to its original state - The collection of statements that must work all
together is called a transaction
50Committing Transactions
- When we do simple statements at the Postgres
prompt, doesnt matter too much - When writing Python code that will insert and
delete stuff, need to take care
51Connecting to Postgres
- Well use the psycopg2 module
- Remember to import it
- To connect we use the connect method
- Returns a connection object
- We then get a cursor object
- Now were set to use the database
con psycopg2.connect('dbnamepython211
useri211 passwordI211pwd') cursor con.cursor()
52Performing Queries
- The cursor object has a method call execute takes
one argument - Create an SQL statement in a string
- Send this to execute
- This function has no return value
sql select id, age from rguha_licensed where
licensed yes cursor.execute(sql)
53How To Handle Errors?
- If your SQL statement has an error an exception
will be thrown - Catch it and report what happened
- If the statement was successful, there is no
response - We check whether we got any results by fetching
them
ret cursor.fetchall()