Information Infrastructure II - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Information Infrastructure II

Description:

Sizes of data sets can vary 50 Kb (30 student records) ... Allows you to interact in a Pythonic way. A row is a Python tuple. The Python DB-API ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 54
Provided by: rajars
Category:

less

Transcript and Presenter's Notes

Title: Information Infrastructure II


1
Information Infrastructure II
  • I211 Week 11
  • Rajarshi Guha

2
Outline
  • Why use a database
  • Terminology
  • Basic SQL
  • Working with Postgres
  • The Python DB-API

3
Storing Data
  • Data comes from many places
  • Medical records
  • Credit card information
  • Molecular structures
  • Gene sequences
  • Inventories
  • Sizes of data sets can vary
  • lt 50 Kb (30 student records)
  • gt 9GB (ATTs switch information for 1 day)

4
Storing Data
  • Depends on size and need
  • Small datasets could be handled using Notepad
  • List of attendees for a party
  • Bigger datasets could be handled in Excel
  • Class grades
  • Personal finance information

5
Storing Data
  • For very large datasets we need specialized
    software
  • Database
  • But databases can be used even for small datasets
  • Dont use Excel as a substitute for a database!

6
Types of Databases
  • Flatfile
  • Essentially an on-disk dictionary
  • e.g. BerkeleyDB
  • Relational
  • What well focus on
  • e.g. PostgreSQL, MySQL, Access, Oracle
  • XML
  • e.g. Xindice, Sedna
  • Object oriented

7
What Does a Database Do?
  • Efficiently store different types of data
  • Maintain integrity of your data
  • Efficiently extract data based on certain
    conditions
  • Databases are not meant to do
  • Complex calculations
  • Presentation of results (i.e. interfaces)

8
Types
  • Data can be of different types
  • Dates
  • Names
  • Age
  • Price of a product
  • JPEG image

9
Types
  • Depending on the type you can do different things
  • Dates - find the number of days since the
    patients last visit
  • Names - find a person whose name starts with S
  • Age - find a person whose age is gt 20
  • Price - find all products which cost lt 20

10
Types
  • Depending on the type, there are some things that
    dont make sense
  • An age cant be less than 0
  • A price cant be less than 0
  • A name cant be an empty string
  • A date cant be before the year 1900
  • Note, some of the conditions are valid in an
    absolute sense
  • But depending on your problem they may not be
    sensible
  • Does a credit card application care about the
    year 1900?

11
Data Integrity
  • The DB should not allow you to mess up your data
  • The data should always make sense
  • Examples
  • A DB for patients, should not repeat the details
    of a given patient
  • If you delete a patient, then all data related to
    that patient should be deleted
  • Should not be able to set a library due date
    before todays date

12
Data Extraction
  • This is the rationale for databases
  • Store large amounts of data
  • Query the data to get what you want
  • Find all patients admitted since last Thursday
  • Find all students who are taking I211 andwho
    have also taken CS398 andwho have done an
    internship
  • Find all stocks that have a market cap of X
    andwhose volume was equal to the max volume last
    month or whose PE ratio dipped below the min PE
    ratio last year

13
Data Extraction
  • Conceptually not difficult to do
  • You can already do this with a text or XML file
    and a bit of Python code
  • But I have 1,000,000 records and I want my answer
    in under 5 sec
  • Ordinary Python programs wont be able to do that
  • Takes more than 5 sec just to read a million
    lines!
  • Time to move to the database

14
Relational Databases
  • Different pieces and types of data can be related
  • A Person is related to Address
  • For a Person, there will be many instances -
    i.e., individual people
  • Similarly for addresses

15
Terminology
  • A relation is also known as a table and
    represents an entity
  • Entity is essentially the same as a concept.
    Could be
  • A person
  • A car
  • A product
  • Tables will have one or more columns (fields)
  • Also known as attributes
  • A record is one row from the table. Also called a
    tuple

16
The R in RDBMS
  • Generally, youll have multiple tables
  • Tables will have relationship between them
  • Choosing how many tables should be used and what
    goes into them is part of database design
  • Quite a complex topic
  • Well established rules for good designs

17
The R in RDBMS - Medical Records
  • Physician - id number, name, etc
  • Mother - id number, weight etc
  • Baby - id number, date delivered, wt

http//audilab.bmed.mcgill.ca/funnell/Bacon/DBMS/
dbms.html
18
The R in RDBMS - Medical Records
  • Physician related to Mother
  • A mother will be attended by one physician
  • Mother related to Baby
  • A given baby has one mother
  • Physician could be related to Baby
  • Depends on theproblem

http//audilab.bmed.mcgill.ca/funnell/Bacon/DBMS/
dbms.html
19
Postgres
  • One of the popular OSS databases
  • Robust, powerful, stable
  • If you write standard SQL it will work on any
    modern RDBMS
  • Each RDBMS has some quirks
  • Names of column types
  • Showing tables
  • Loading data

20
Postgres
  • Well focus on using standard SQL
  • Ill provide information on working with Postgres
  • You should be able to transfer the knowledge to
    MySQL or SQLServer or Oracle if required

21
Logging Into Postgres
  • ssh to Sulu
  • At the prompt psql python211 -U i211
  • The name of the database is python211
  • Password is I211pwd
  • Note - All of you work in the same database.
    Respect others!

22
Creating a Table
  • Provide a table name
  • Decide on the attributes (columns, fields) for
    the table
  • Decide on the types of the columns
  • Write the SQL

23
Creating a Table
  • Dont use capitals
  • Can include as many columns you want
  • Dictated by the design
  • Usually need to add constraints
  • List the tabledefinition using\d tablename

create table mytable ( id integer, firstname
text, lastname text, age integer)
24
Creating a Table
  • python211gt \d mytable
  • Table "public.mytable"
  • Column Type Modifiers
  • -------------------------------
  • id integer
  • firstname text
  • lastname text
  • age integer

See http//www.postgresql.org/docs/7.4/interactive
/datatype.html
25
Dropping a Table
  • If a table with the same name exists, then create
    table will fail
  • Drop the table first using drop table tablename

26
Simple Constraints
  • Currently, we can enter an age of -5
  • Doesnt make sense
  • We could try and discipline ourselves and not put
    in such values
  • We might make a mistake
  • We should let the database enforce such
    requirements
  • Constraints

27
Simple Constraints
  • Types of constraints
  • Check - limits the possible values
  • Not null - ensures that something cannot be null
  • Default values - specifies a value if nothing is
    specified
  • Unique - indicates that a column will have unique
    values (i.e., no repeats)

See http//www.postgresql.org/docs/7.4/interactive
/ddl-constraints.htmlAEN1898
28
Simple Constraints
The id column can only have unique values
  • create table mytable (
  • id integer unique,
  • firstname text not null,
  • lastname text not null,
  • age integer default 10,
  • check (age gt 0)
  • )

Can not have a NULL value
If we dont specify an age when inserting data,
its set to 10
Can never insert a negative age
29
Inserting Data
  • General format insert into tablename
    values (val1, val2, , valN)
  • You generally have to provide values for each of
    the columns you defined for the table
  • Constraints may prevent you from inserting
    certain values

30
Inserting Data
  • We said that age has to be a positive non-zero
    number
  • Cant insert two rows with the same value of id

python211gt insert into mytable values (1,
'John', 'Smith', 0) ERROR new row for relation
"mytable" violates check constraint "1"
python211gt insert into mytable values (1,
'John', 'Smith', 10) INSERT 17186 1 python211gt
insert into mytable values (1, 'Jane', 'Smith',
12) ERROR duplicate key violates unique
constraint "mytable_id_key
31
Inserting Data
  • Since we have set default values, we dont need
    to specify all the column values
  • But we have only specified a default value for
    age, not the other columns

python211gt insert into mytable values (2,
'Jeremy', 'Fish') INSERT 17188 1 python211gt
insert into mytable values (2) ERROR null
value in column "firstname" violates not-null
constraint
32
Inserting Data
  • Good practice is to specify the column names for
    which you are inserting data

python211gt insert into mytable (id, firstname,
lastname, age) values python211-gt (3,
'Richard', 'Block', 22) INSERT 17189 1
33
Selecting Data
  • Weve been putting data into the table
  • How can we see what we put in?
  • All of what we put in
  • Some of what we put in
  • One of the primary tasks in using a databases is
    to select rows

34
Selecting Data
  • Select all the columns for all the rows from a
    table select from mytable
  • Select age and last name for all rows
  • select lastname, age from mytable
  • Most common error is to misspell a column or
    table name

35
Selecting Certain Rows
  • Usually a table will contain 1000s or millions
    of rows
  • Selecting all rows is not helpful
  • We select certain rows by using a condition

36
Selecting Certain Rows
  • Select all people whose age lt 10 select
    firstname, lastname from mytable where age lt 10
  • Select all people between 10 and 22 select
    firstname, lastname from mytable where age gt 10
    and age lt 22

37
Selecting Certain Rows
  • You can select based on text
  • Select all rows with a certain first name select
    from mytable where firstname Robert
  • Select all rows with a last name starting with
    S select from mytable where lastname S
  • This is an example of using regexes in a select
    condition

38
Deleting Rows
  • Youve already seen how to delet a whole
    table drop table mytable
  • To delete all rows from a table delete from
    mytable
  • Be very careful!
  • Always validate input in database applications!
  • The table itself still remains, so you can add
    new rows etc

39
Deleting Rows
  • You can delete certain rows by using conditions
  • Similar to how we use select and conditions
  • delete from mytable where age lt 10
  • You can check whether the delete did what you
    want by doing a select with the same condition
  • If you did do it wrong, cant get back what you
    deleted!

40
Updating Rows
  • Sometimes you want to just change the value of
    one column in a row (or multiple rows)
  • To do an updat you need three pieces of
    information
  • Name of table and column
  • New value of the column
  • Which row(s) to update

41
Updating Rows
  • For anybody whose age lt 15, set their age to 74
  • For anybody whose first name starts with a J,
    set their age to 34and last name to Blah

update mytable set age 74 where age lt
15 update mytable set age 34, lastname
'Blah' where firstname 'J'
42
More on Select
  • The database is designed for fast searching
  • Does require some help from you when dealing with
    large datasets
  • So, we should try and do as much manipulation
    within the database
  • Getting min and max
  • Sorting
  • Getting the top 10 results

43
Sorting Results
  • You sort the results of a query by a specific
    column
  • The column should exist in the table being
    queried

select from rguha_scores where age lt 30 order
by timing select from rguha_scores where age
lt 30 order by timing desc select id, age from
rguha_scores where age lt 30 order by timing
44
Limiting the Results
  • Sometimes well get back thousands of rows
  • Cumbersome to view
  • May want just the top 10 or 20
  • Use limit to control how many results you need

45
Limiting the Results
  • You can use limit to get just a few rows
  • Note that the DB does not always store the data
    in the order that you input it

select from rguha_scores limit 10 select
from rguha_scores where age gt 18 and age lt 33
order by timing limit 5
46
The Python DB-API
  • A consistent way to access different databases
  • If your program uses this, your application can
    use any database supported by the API
  • Just need to change 2 lines
  • Hides much of the database internals
  • Allows you to interact in a Pythonic way
  • A row is a Python tuple

47
The Python DB-API
  • You need to construct SQL statements as strings
  • The API will allow you to
  • Send a query
  • Get the results of a query
  • One at a time
  • All together
  • Commit transactions

48
Committing Transactions
  • When you insert data, it might take more than one
    step
  • What happens if the power goes of between the
    steps?
  • You have part of the data inserted
  • The rest is not in the DB
  • You record is not valid anymore

49
Committing Transactions
  • Operations should be atomic
  • All the individual statements must complete
    successfully for the whole thing to be successful
  • If one of the steps fails, the database goes back
    to its original state
  • The collection of statements that must work all
    together is called a transaction

50
Committing Transactions
  • When we do simple statements at the Postgres
    prompt, doesnt matter too much
  • When writing Python code that will insert and
    delete stuff, need to take care

51
Connecting to Postgres
  • Well use the psycopg2 module
  • Remember to import it
  • To connect we use the connect method
  • Returns a connection object
  • We then get a cursor object
  • Now were set to use the database

con psycopg2.connect('dbnamepython211
useri211 passwordI211pwd') cursor con.cursor()
52
Performing Queries
  • The cursor object has a method call execute takes
    one argument
  • Create an SQL statement in a string
  • Send this to execute
  • This function has no return value

sql select id, age from rguha_licensed where
licensed yes cursor.execute(sql)
53
How To Handle Errors?
  • If your SQL statement has an error an exception
    will be thrown
  • Catch it and report what happened
  • If the statement was successful, there is no
    response
  • We check whether we got any results by fetching
    them

ret cursor.fetchall()
Write a Comment
User Comments (0)
About PowerShow.com