Title: CSE490i Advanced Internet Systems
1Why do we care about databases in a course about
Web?
2Administrivia
- Midterm next class (in class)
- Openbook/Open Notes/open course web (not general
web) - Everything done until Sep 28th (and covered in
homework 2) - Questions?
3Adapting old disciplines for Web-age
- Information (text) retrieval
- Scale of the web
- Hyper text/ Link structure
- Authority/hub computations
- Databases
- Multiple databases
- Heterogeneous, access limited, partially
overlapping - Network (un)reliability
- Datamining Machine Learning/Statistics/Databases
- Learning patterns from large scale data
4Why do we care about databases?
- Three reasons
- Deep web is all databases
- We can do better with structured data
- Exposing databases on web changes their
clientele..
5Deep Web is databases..
- The crawlable web pages are just the tip of a
huge ice berg that is deep web - Many web sites have huge backend databases that
generate pages dynamically in response to queries - Airline fare databases News paper classifieds
etc. - By some estimates, deep web is 2 orders of
magnitude bigger than the shallow (html page)
web - We need to exploit deep web
- Crawl/index deep web
- Select databases relevant to a query
- Provide information aggregation/integration
services over deep web databases - ..and all the big kids are trying to gobble up
anyone who is even going through the motions of
doing these.. - which leads to several DB challenges not
addressed in traditional DBs - Wrapper generation
- Schema mapping
- (automated) form filling
- Query optimization
- Learning source profiles
6Web brings unwashed masses, unreliable medium as
well as dirty data to databases..
- Web accessibility changes the user/data/medium
profile significantly - from SQL gurus supporting financial data on
dedicated DBMS to 2.1 keyword query instant
gratification seekers working with
dirty/inconsistent data over unreliable web. - Challenges
- How does one support keyword queries in
databases? - How does one support imprecise queries in
databases? - How do we handle incompleteness/inconsistency in
databases? - Does it make sense to focus on total response
time minimization - As against a multi-objective cost/benefit
optimization?
The DB community has embraced these challenges
--see Lowell Report
7Databases offer lessons on exploiting structure
- We argued that structure (and semantics) help
querying - If there is structure (as in databases) we can
exploit it - Databases is an existing technology for
exploiting some forms of structure - SQL may not look like much, but it is more
expressive than keyword queries! - If not, we can extract structure and then exploit
it - Challenges
- Techniques for extracting information (NLP-lite)
- Languages for representing/handling
Semi-structured data - Standards for supporting/exploiting semantic
tagging
8Before we play havoc with databases, lets
quickly review the traditional art of db
managementso we know all that needs to change
9This Day in History
- 1867 US purchases Alaska from Russia for
7.2 million (2 cents/acre) - 1953 Einstein announces revised unified field
theory - 1954 Test Cricket debut of Sir Garry Sobers
vs. England - 1981 President Reagan shot wounded by John
W Hinckley Jr - 2004 The first ever regular class of Rao
taught by someone other than Rao
10Concepts covered so far
- Information Retrieval
- Text retrieval
- Hyper-linked text retrieval
- Improvements
- Information Mining
- Clustering techniques to improve result
presentation - Classification and filtering techniques
11Structured data..
- Focus on text data till date.
- However, a lot of the data available on the web
is actually from (semi-)structured databases !!!! - They do their best to look like they are text
sources - What are the issues and opportunities brought up
by the presence of such sources on the web?
12Databases !!!??? you may have used
13Is the a DBMS?
Skeptics corner
- Fairly sophisticated search available
- crawler indexes pages on the web
- Keyword-based search for pages
- But, currently
- data is mostly unstructured and untyped
- search only
- cant modify the data
- cant get summaries, complex combinations of data
- Web sites typically have a DBMS in the background
to provide these functions. - They dynamically convert (wrap) the structured
data into readable English - ltIndia, New Delhigt gt The capital of India is
New Delhi. - So, if we can unwrap the text, we have
structured data! - Note also that such dynamic pages cannot be
crawled... - The (coming) Semi-structured web
- Most pages are at least semi-structured
- XML standard is expected to ease the
presentation/on-the-wire transfer of such pages.
(BUT..) - The Services
- Travel services, mapping services
- The Sensors
14Structure
An employee record
A generic web page containing text
A movie review
- How will search and querying on these three types
of data differ?
Semi-Structured
15Search vs. Query
- What if you wanted to find out which actors
donated to Al Gores presidential campaign? - Try actors donated to gore in your favorite
search engine.
16Structure helps querying
- Expressive queries
- Give me all pages that have key words Get Rich
Quick - Give me the social security numbers of all the
employees who have stayed with the company for
more than 5 years, and whose yearly salaries are
three standard deviations away from the average
salary - Give me all mails from people from ASU written
this year, which are relevant to get rich quick
- Efficient searching
- equality vs. similarity
- range-limited search
17Why use a DBMS in your website?
- Suppose we are building web-based music
distribution site. - Several questions arise
- How do we store the data? (file organization,
etc.) - How do we query the data? (write programs)
- Make sure that updates dont mess things up?
- Provide different views on the data? (registrar
versus students) - How do we deal with crashes?
- Way too complicated!
- Buy a database system!
18What Is a Database System?
- Database
a very
large, integrated collection of data. - Models a real-world enterprise
- Entities (e.g., teams, games)
- Relationships
- (e.g., The Patriots are playing in The
Superbowl) - More recently, also includes active components ,
often called business logic. (e.g., the BCS
ranking system) - A Database Management System (DBMS) is a software
system designed to store, manage, and facilitate
access to databases.
19Functionality of a DBMS
- Data Dictionary Management
- Storage management
- Data storage Definition Language (DDL)
- High level query and data manipulation language
- SQL/XQuery etc.
- May tell us what we are missing in text-based
search - Efficient query processing
- May change in the internet scenario
- Transaction processing
- Resiliency recovery from crashes,
- Different views of the data, security
- May be useful to model a collection of databases
together - Interface with programming languages
20Traditional Database Architecture
21Building an Application with a Database System
- Requirements modeling (conceptual, pictures)
- Decide what entities should be part of the
application and how they should be linked. - Schema design and implementation
- Decide on a set of tables, attributes.
- Define the tables in the database system.
- Populate database (insert tuples).
- Write application programs using the DBMS
- Now much easier, with data management API
22 Conceptual Modeling
ssn
23Data Models
- A data model is a collection of concepts for
describing data. - A schema is a description of a particular
collection of data, using a given data model. - The relational model of data is the most widely
used model today. - Main concept relation, basically a table with
rows and columns. - Every relation has a schema, which describes the
columns, or fields.
24Levels of Abstraction
- Views describe how users see the data.
-
- Conceptual schema defines logical structure
- Physical schema describes the files and indexes
used.
25Example University Database
- Conceptual schema
- Students(sid string, name string,
login string, age integer, gpareal) - Courses(cid string, cnamestring,
creditsinteger) - External Schema (View)
- Course_info(cidstring,enrollmentinteger)
- Physical schema
- Relations stored as unordered files.
- Index on first column of Students.
If five people are asked to come up with a schema
for the data, what are the odds that they will
come up with the same schema?
26Data Independence
- Applications insulated from
- how data is structured and stored.
- Logical data independence Protection from
changes in logical structure of data. - Physical data independence Protection from
changes in physical structure of data. - Q Why are these particularly important for DBMS?
27Schema Design Implementation
- Table Students
- Separates the logical view from the physical view
of the data.
28Terminology
Attribute names
Students
tuples
(Arity3)
29Querying a Database
- Find all the students taking CSE594 in Q1, 2004
- S(tructured) Q(uery) L(anguage)
- select E.name
- from Enroll E
- where E.courseCS490i and
- E.quarterWinter, 2000
- Query processor figures out how to answer the
query efficiently.
30Relational Algebra
- Operators
- tuple sets as input, new set as output
- Basic Binary Set Operators
- Result is table (set) with same attributes
- Sets must be compatible!
- R1(A1,A2,A3) ? R2(B1,B2,B3)
- ? Domain(Ai) Domain(Bi)
- Union
- All tuples in either R1 or in R2
- Intersection
- All tuples in both R1 and R2
- Difference
- All tuples in R1 but not in R2
- Complement
- All tuples not in R1
- Selection, Projection, Cartesian Product, Join
whats the universe?
31Selection s
- Grab a subset of the tuples in a relation that
satisfy a given condition - Use and, or, not, gt, lt to build condition
- Unary operation returns set with same
attributes, but selects rows
32Selection Example
Employee
SSN
Name
DepartmentID
Salary
999999999
John
1
30,000
777777777
Tony
1
32,000
888888888
Alice
2
45,000
33Projection p
- Unary operation, selects columns
- Returned schema is different,
- So returned tuples are not subset of original set
- Contrast with selection
- Eliminates duplicate tuples
34(No Transcript)
35Cartesian Product X
- Binary Operation
- Result is set of tuples combining all elements of
R1 with all elements of R2, for R1 ? R2 - Schema is union of Schema(R1) Schema(R2)
- Notice we could do selection on result to get
meaningful info!
36Cartesian Product Example
37Join
- Most common (and exciting!) operator
- Combines 2 relations
- Selecting only related tuples
- Result has all attributes of the two relations
- Equivalent to
- Cross product followed by selection followed by
Projection - Equijoin
- Join condition is equality between two attributes
- Natural join
- Equijoin on attributes of same name
- result has only one copy of join condition
attribute
38Example Natural Join
39Complex Queries
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number, city)
Find phone numbers of people who bought gizmos
from Fred. Find telephony products that
somebody bought
40Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who bought American products and they
live in Seattle. Ex 5 Find people who bought
stuff from Joe or bought products
from a company whose stock prices is more than
50.
41SQL Introduction
Standard language for querying and manipulating
data Structured Query
Language
Many standards out there SQL92, SQL2, SQL3,
SQL99 Vendors support various subsets of
these (but well only discuss a subset of what
they support) Basic form syntax on relational
algebra (but many other features too) Select
attributes From relations (possibly
multiple, joined) Where conditions
(selections)
42Selections s
SELECT FROM
Company WHERE countryUSA AND
stockPrice gt 50 You can use
Attribute names of the relation(s) used in the
FROM. Comparison operators , ltgt,
lt, gt, lt, gt Apply arithmetic
operations stockprice2 Operations
on strings (e.g., for concatenation).
Lexicographic order on strings.
Pattern matching s LIKE p Special
stuff for comparing dates and times.
43Projection p
Select only a subset of the attributes
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
Rename the attributes in the resulting table
SELECT name AS company,
stockprice AS price FROM
Company WHERE countryUSA AND
stockPrice gt 50
44Ordering the Results
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
ORDERBY country, name
Ordering is ascending, unless you specify the
DESC keyword. Ties are broken by the second
attribute on the ORDERBY list, etc.
45Join
SELECT name, store
FROM Person, Purchase WHERE
per-namebuyer AND citySeattle
AND
productgizmo Product ( pname, price,
category, maker) Purchase (buyer, seller,
store, product) Company (cname, stock price,
country) Person( per-name, phone number, city)
46Disambiguating Attributes
Find names of people buying telephony products
SELECT Person.name FROM
Person, Purchase, Product WHERE
Person.namebuyer
AND productProduct.name
AND Product.categorytelephony Product (
name, price, category, maker) Purchase (buyer,
seller, store, product) Person( name, phone
number, city)
47Tuple Variables
Find pairs of companies making products in the
same category
SELECT product1.maker, product2.maker
FROM Product AS product1, Product AS
product2 WHERE
product1.category product2.category
AND product1.maker ltgt
product2.maker
Product ( name, price, category, maker)
48Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
product) Company (cname, stock-price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who live in Seattle and who bought American
products. Ex 5 Find people who bought stuff
from Joe or bought products from a
company whose stock prices is more than 50.
49Views
50Defining Views
(Virtual) Views are relations, except that they
are not physically stored. They are used
mostly in order to simplify complex queries
and to define conceptually different views of the
database to different classes of users. View
purchases of telephony products CREATE VIEW
telephony-purchases AS SELECT product, buyer,
seller, store FROM Purchase, Product WHERE
Purchase.product Product.name
AND Product.category telephony
51A Different View
CREATE VIEW Seattle-view AS SELECT
buyer, seller, product, store FROM
Person, Purchase WHERE Person.city
Seattle AND
Person.name Purchase.buyer
We can later use the views SELECT
name, store FROM Seattle-view,
Product WHERE Seattle-view.product
Product.name AND
Product.category shoes
Whats really happening when we query a view??
52Updating Views
How can I insert a tuple into a table that
doesnt exist? CREATE VIEW bon-purchase AS
SELECT store, seller, product FROM
Purchase WHERE store The Bon
Marche If we make the following insertion
INSERT INTO bon-purchase VALUES
(the Bon Marche, Joe, Denby Mug) We can
simply add a tuple (the Bon Marche,
Joe, NULL, Denby Mug) to relation Purchase.
53Non-Updatable Views
Given Purchase (buyer, seller, store,
product) Person( name, phone-num, city)
CREATE VIEW Seattle-view AS SELECT
seller, product, store FROM Person,
Purchase WHERE Person.city Seattle
AND Person.name
Purchase.buyer
Why non-updatable?
How can we add the following tuple to the view?
(Joe, Shoe Model 12345, Nine West)
54Materialized Views
- Views whose corresponding queries have been
executed and the data is stored in a separate
database - Uses Caching
- Issues
- Using views in answering queries
- Normally, the views are available in addition to
database - (so, views are local caches)
- In information integration, views may be the only
things we have access to. - An internet source that specializes in woody
allen movies can be seen as a view on a database
of all movies. Except, there is no database out
there which contains all movies.. - Maintaining consistency of materialized views
55Issues w.r.t. Databases on the Web
- Information Extraction (invert the tuple to text
transformation) - Support lay user queries
- More flexible queries
- Exact (SQL) vs Approximate/Similar (Text search?)
- On semi-structured databases
- Joins over text attributes?
- Exact (SQL) vs Approximate/Similar !!!!!
- Support integration/aggregation of multiple
databases - Take a query from the user and send it to all
relevant databases - TONS of challenges
56Imprecise Queries
- Increasing number of Web accessible databases
- E.g. bibliographies, reservation systems,
department catalogs etc - Support for precise queries only exactly
matching tuples - Difficulty in extracting desired information
- Limited query capabilities provided by form based
query interface - Lack of schema/domain information
- Increasing complexity of types of data e.g.
hyptertext, images etc - Often times user wants about the same instead
of exact - Bibliography search find similar publications
Solution Provide answers closely matching query
constraints
57Query Optimization
58Query Optimization
Goal
Imperative query execution plan
Declarative SQL query
SELECT S.buyer FROM Purchase P, Person Q WHERE
P.buyerQ.name AND Q.cityseattle AND
Q.phone gt 5430000
- Inputs
- the query
- statistics about the data (indexes,
cardinalities, selectivity factors) - available memory
Ideally Want to find best plan. Practically
Avoid worst plans!
59(On-the-fly)
sname
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
- Goal of optimization To find more efficient
plans that compute the same answer.
(On-the-fly)
rating gt 5
with pipelining )
sidsid
(Use hash
Sailors
bid100
index do
not write
result to
temp)
Reserves
60Optimizing Joins
- Q(u,x) - R(u,v), S(v,w), T(w,x)
- R S T
- Many ways of doing a single join R S
- Symmetric vs. asymmetric join operations
- Nested join, hash join, double pipe-lined hash
join etc. - Processing costs alone vs. processing transfer
costs - Get R and S together vs, get R, get just the
tuples of S that will join with R (semi-join) - Many orders in which to do the join
- (R join S) join T
- (S join R) join T
- (T join S) join R etc.
- All with different costs
61Determining Join Order
- In principle, we need to consider all possible
join orderings - As the number of joins increases, the number of
alternative plans grows rapidly we need to
restrict the search space. - System-R consider only left-deep join trees.
- Left-deep trees allow us to generate all fully
pipelined plansIntermediate results not written
to temporary files. - Not all left-deep trees are fully pipelined
(e.g., SM join).
62Query Optimization Process(simplified a bit)
- Parse the SQL query into a logical tree
- identify distinct blocks (corresponding to nested
sub-queries or views). - Query rewrite phase
- apply algebraic transformations to yield a
cheaper plan. - Merge blocks and move predicates between blocks.
- Optimize each block join ordering.
- Complete the optimization select scheduling
(pipelining strategy).
63Cost Estimation
- For each plan considered, must estimate cost
- Must estimate cost of each operation in plan
tree. - Depends on input cardinalities.
- Must estimate size of result for each operation
in tree! - Use information about the input relations.
- For selections and joins, assume independence of
predicates. - System R cost estimation approach.
- Very inexact, but works ok in practice.
- More sophisticated techniques known now.
64Key Lessons in Optimization
- There are many approaches and many details to
consider in query optimization - Classic search/optimization problem!
- Not completely solved yet!
- Main points to take away are
- Algebraic rules and their use in transformations
of queries. - Deciding on join ordering System-R style
(Selinger style) optimization. - Estimating cost of plans and sizes of
intermediate results.
65Concurrency Control
- Concurrent execution of user programs key to
good DBMS performance. - Disk accesses frequent, pretty slow
- Keep the CPU working on several programs
concurrently. - Interleaving actions of different programs
trouble! - e.g., account-transfer print statement at same
time - DBMS ensures such problems dont arise.
- Users/programmers can pretend they are using a
single-user system. (called Isolation) - Thank goodness! Dont have to program very,
very carefully.
66Transactions ACID Properties
- Key concept is a transaction a sequence of
database actions (reads/writes). - DBMS ensures atomicity (all-or-nothing property)
even if system crashes in the middle of a Xact. - Each transaction, executed completely, must take
the DB between consistent states or must not run
at all. - DBMS ensures that concurrent transactions appear
to run in isolation. - DBMS ensures durability of committed Xacts even
if system crashes. -
- Note can specify simple integrity constraints on
the data. The DBMS enforces these. - Beyond this, the DBMS does not understand the
semantics of the data. - Ensuring that a single transaction (run alone)
preserves consistency is largely the users
responsibility!
67Scheduling Concurrent Transactions
- DBMS ensures that execution of T1, ... , Tn is
equivalent to some serial execution T1 ... Tn. - Before reading/writing an object, a transaction
requests a lock on the object, and waits till the
DBMS gives it the lock. All locks are held
until the end of the transaction. (Strict 2PL
locking protocol.) - Idea If an action of Ti (say, writing X) affects
Tj (which perhaps reads X), say Ti obtains the
lock on X first so Tj is forced to wait until
Ti completes.This effectively orders the
transactions. - What if Tj already has a lock on Y and Ti
later requests a lock on Y? (Deadlock!) Ti or Tj
is aborted and restarted!
68Ensuring Transaction Properites
- DBMS ensures atomicity (all-or-nothing property)
even if system crashes in the middle of a Xact. - DBMS ensures durability of committed Xacts even
if system crashes. - Idea Keep a log (history) of all actions carried
out by the DBMS while executing a set of Xacts - Before a change is made to the database, the
corresponding log entry is forced to a safe
location. (WAL protocol OS support for this is
often inadequate.) - After a crash, the effects of partially executed
transactions are undone using the log. Effects of
committed transactions are redone using the log. - trickier than it sounds!