Title: CSE490i Advanced Internet Systems
1Topic 3 Finding, Representing Exploiting
Structure
Getting Structure Allow structure specification
languages ? XML? More structured than text
and less structured than databases If structure
is not explicitly specified (or is obfuscated),
can we extract it? ?Wrapper
generation/Information Extraction Using
Structure For retrieval ?Extend IR
techniques to use the additional structure For
query processing (Joins/Aggregations etc)
?Extend database techniques to use the partial
structure For reasoning with structured
knowledge ?Semantic web ideas.. Structure in
the context of multiple sources How to align
structure How to support integrated querying on
pages/sources (after alignment)
2Structure
An employee record
A generic web page containing text
A movie review
- How will search and querying on these three types
of data differ?
Semi-Structured
3Structure helps querying
- Expressive queries
- Give me all pages that have key words Get Rich
Quick - Give me the social security numbers of all the
employees who have stayed with the company for
more than 5 years, and whose yearly salaries are
three standard deviations away from the average
salary - Give me all mails from people from ASU written
this year, which are relevant to get rich quick
4Adapting old disciplines for Web-age
- Information (text) retrieval
- Scale of the web
- Hyper text/ Link structure
- Authority/hub computations
- Databases
- Multiple databases
- Heterogeneous, access limited, partially
overlapping - Network (un)reliability
- Datamining Machine Learning/Statistics/Databases
- Learning patterns from large scale data
5Why do we care about databases?
- Three reasons
- Deep web is all databases
- We can do better with structured data
- Exposing databases on web changes their
clientele..
6Deep Web is databases..
- The crawlable web pages are just the tip of a
huge ice berg that is deep web - Many web sites have huge backend databases that
generate pages dynamically in response to queries - Airline fare databases News paper classifieds
etc. - By some estimates, deep web is 2 orders of
magnitude bigger than the shallow (html page)
web - We need to exploit deep web
- Crawl/index deep web
- Select databases relevant to a query
- Provide information aggregation/integration
services over deep web databases - ..and all the big kids are trying to gobble up
anyone who is even going through the motions of
doing these.. - which leads to several DB challenges not
addressed in traditional DBs - Wrapper generation
- Schema mapping Structure alignment
- (automated) form filling
- Query optimization
- Learning source profiles
7Databases offer lessons on exploiting structure
- We argued that structure (and semantics) help
querying - If there is structure (as in databases) we can
exploit it - Databases is an existing technology for
exploiting some forms of structure - SQL may not look like much, but it is more
expressive than keyword queries! - If not, we can extract structure and then exploit
it - Challenges
- Techniques for extracting information (NLP-lite)
- Languages for representing/handling
Semi-structured data - Standards for supporting/exploiting semantic
tagging
8Before we play havoc with databases, lets
quickly review the traditional art of db
managementso we know all that needs to change
9This Day in History
- 1867 US purchases Alaska from Russia for
7.2 million (2 cents/acre) - 1953 Einstein announces revised unified field
theory - 1954 Test Cricket debut of Sir Garry Sobers
vs. England - 1981 President Reagan shot wounded by John
W Hinckley Jr - 2004 The first ever regular class of Rao
taught by someone other than Rao
10Structured data..
- Focus on text data till date.
- However, a lot of the data available on the web
is actually from (semi-)structured databases !!!! - They do their best to look like they are text
sources - What are the issues and opportunities brought up
by the presence of such sources on the web?
11Databases !!!??? you may have used
12Is the a DBMS?
Skeptics corner
- Fairly sophisticated search available
- crawler indexes pages on the web
- Keyword-based search for pages
- But, currently
- data is mostly unstructured and untyped
- search only
- cant modify the data
- cant get summaries, complex combinations of data
- Web sites typically have a DBMS in the background
to provide these functions. - They dynamically convert (wrap) the structured
data into readable English - ltIndia, New Delhigt gt The capital of India is
New Delhi. - So, if we can unwrap the text, we have
structured data! - Note also that such dynamic pages cannot be
crawled... - The (coming) Semi-structured web
- Most pages are at least semi-structured
- XML standard is expected to ease the
presentation/on-the-wire transfer of such pages.
(BUT..) - The Services
- Travel services, mapping services
- The Sensors
13Structure
An employee record
A generic web page containing text
A movie review
- How will search and querying on these three types
of data differ?
Semi-Structured
14Search vs. Query
- What if you wanted to find out which actors
donated to Al Gores presidential campaign? - Try actors donated to gore in your favorite
search engine.
15Structure helps querying
- Expressive queries
- Give me all pages that have key words Get Rich
Quick - Give me the social security numbers of all the
employees who have stayed with the company for
more than 5 years, and whose yearly salaries are
three standard deviations away from the average
salary - Give me all mails from people from ASU written
this year, which are relevant to get rich quick
- Efficient searching
- equality vs. similarity
- range-limited search
16Why use a DBMS in your website?
- Suppose we are building web-based music
distribution site. - Several questions arise
- How do we store the data? (file organization,
etc.) - How do we query the data? (write programs)
- Make sure that updates dont mess things up?
- Provide different views on the data? (registrar
versus students) - How do we deal with crashes?
- Way too complicated!
- Buy a database system!
17What Is a Database System?
- Database
a very
large, integrated collection of data. - Models a real-world enterprise
- Entities (e.g., teams, games)
- Relationships
- (e.g., The Patriots are playing in The
Superbowl) - More recently, also includes active components ,
often called business logic. (e.g., the BCS
ranking system) - A Database Management System (DBMS) is a software
system designed to store, manage, and facilitate
access to databases.
18Functionality of a DBMS
- Data Dictionary Management
- Storage management
- Data storage Definition Language (DDL)
- High level query and data manipulation language
- SQL/XQuery etc.
- May tell us what we are missing in text-based
search - Efficient query processing
- May change in the internet scenario
- Transaction processing
- Resiliency recovery from crashes,
- Different views of the data, security
- May be useful to model a collection of databases
together - Interface with programming languages
19Traditional Database Architecture
20Building an Application with a Database System
- Requirements modeling (conceptual, pictures)
- Decide what entities should be part of the
application and how they should be linked. - Schema design and implementation
- Decide on a set of tables, attributes.
- Define the tables in the database system.
- Populate database (insert tuples).
- Write application programs using the DBMS
- Now much easier, with data management API
21 Conceptual Modeling
ssn
22Data Models
- A data model is a collection of concepts for
describing data. - A schema is a description of a particular
collection of data, using a given data model. - The relational model of data is the most widely
used model today. - Main concept relation, basically a table with
rows and columns. - Every relation has a schema, which describes the
columns, or fields.
23Levels of Abstraction
- Views describe how users see the data.
-
- Conceptual schema defines logical structure
- Physical schema describes the files and indexes
used.
24Example University Database
- Conceptual schema
- Students(sid string, name string,
login string, age integer, gpareal) - Courses(cid string, cnamestring,
creditsinteger) - External Schema (View)
- Course_info(cidstring,enrollmentinteger)
- Physical schema
- Relations stored as unordered files.
- Index on first column of Students.
If five people are asked to come up with a schema
for the data, what are the odds that they will
come up with the same schema?
25Data Independence
- Applications insulated from
- how data is structured and stored.
- Logical data independence Protection from
changes in logical structure of data. - Physical data independence Protection from
changes in physical structure of data. - Q Why are these particularly important for DBMS?
26Schema Design Implementation
- Table Students
- Separates the logical view from the physical view
of the data.
27Terminology
Attribute names
Students
tuples
(Arity3)
28Querying a Database
- Find all the students taking CSE594 in Q1, 2004
- S(tructured) Q(uery) L(anguage)
- select E.name
- from Enroll E
- where E.courseCS490i and
- E.quarterWinter, 2000
- Query processor figures out how to answer the
query efficiently.
29Relational Algebra
- Operators
- tuple sets as input, new set as output
- Basic Binary Set Operators
- Result is table (set) with same attributes
- Sets must be compatible!
- R1(A1,A2,A3) ? R2(B1,B2,B3)
- ? Domain(Ai) Domain(Bi)
- Union
- All tuples in either R1 or in R2
- Intersection
- All tuples in both R1 and R2
- Difference
- All tuples in R1 but not in R2
- Complement
- All tuples not in R1
- Selection, Projection, Cartesian Product, Join
whats the universe?
30Selection s
- Grab a subset of the tuples in a relation that
satisfy a given condition - Use and, or, not, gt, lt to build condition
- Unary operation returns set with same
attributes, but selects rows
31Selection Example
Employee
SSN
Name
DepartmentID
Salary
999999999
John
1
30,000
777777777
Tony
1
32,000
888888888
Alice
2
45,000
32Projection p
- Unary operation, selects columns
- Returned schema is different,
- So returned tuples are not subset of original set
- Contrast with selection
- Eliminates duplicate tuples
33(No Transcript)
34Cartesian Product X
- Binary Operation
- Result is set of tuples combining all elements of
R1 with all elements of R2, for R1 ? R2 - Schema is union of Schema(R1) Schema(R2)
- Notice we could do selection on result to get
meaningful info!
35Cartesian Product Example
36Join
- Most common (and exciting!) operator
- Combines 2 relations
- Selecting only related tuples
- Result has all attributes of the two relations
- Equivalent to
- Cross product followed by selection followed by
Projection - Equijoin
- Join condition is equality between two attributes
- Natural join
- Equijoin on attributes of same name
- result has only one copy of join condition
attribute
37Example Natural Join
38Complex Queries
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number, city)
Find phone numbers of people who bought gizmos
from Fred. Find telephony products that
somebody bought
39Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who bought American products and they
live in Seattle. Ex 5 Find people who bought
stuff from Joe or bought products
from a company whose stock prices is more than
50.
40SQL Introduction
Standard language for querying and manipulating
data Structured Query
Language
Many standards out there SQL92, SQL2, SQL3,
SQL99 Vendors support various subsets of
these (but well only discuss a subset of what
they support) Basic form syntax on relational
algebra (but many other features too) Select
attributes From relations (possibly
multiple, joined) Where conditions
(selections)
41Selections s
SELECT FROM
Company WHERE countryUSA AND
stockPrice gt 50 You can use
Attribute names of the relation(s) used in the
FROM. Comparison operators , ltgt,
lt, gt, lt, gt Apply arithmetic
operations stockprice2 Operations
on strings (e.g., for concatenation).
Lexicographic order on strings.
Pattern matching s LIKE p Special
stuff for comparing dates and times.
42Projection p
Select only a subset of the attributes
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
Rename the attributes in the resulting table
SELECT name AS company,
stockprice AS price FROM
Company WHERE countryUSA AND
stockPrice gt 50
43Ordering the Results
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
ORDERBY country, name
Ordering is ascending, unless you specify the
DESC keyword. Ties are broken by the second
attribute on the ORDERBY list, etc.
44Join
SELECT name, store
FROM Person, Purchase WHERE
per-namebuyer AND citySeattle
AND
productgizmo Product ( pname, price,
category, maker) Purchase (buyer, seller,
store, product) Company (cname, stock price,
country) Person( per-name, phone number, city)
45Disambiguating Attributes
Find names of people buying telephony products
SELECT Person.name FROM
Person, Purchase, Product WHERE
Person.namebuyer
AND productProduct.name
AND Product.categorytelephony Product (
name, price, category, maker) Purchase (buyer,
seller, store, product) Person( name, phone
number, city)
46Tuple Variables
Find pairs of companies making products in the
same category
SELECT product1.maker, product2.maker
FROM Product AS product1, Product AS
product2 WHERE
product1.category product2.category
AND product1.maker ltgt
product2.maker
Product ( name, price, category, maker)
47Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
product) Company (cname, stock-price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who live in Seattle and who bought American
products. Ex 5 Find people who bought stuff
from Joe or bought products from a
company whose stock prices is more than 50.
48Views
49Defining Views
(Virtual) Views are macro relations defined
in terms of base relations (they may or may not
be physically stored) They are used mostly in
order to simplify complex queries and to define
conceptually different views of the database to
different classes of users. View purchases of
telephony products CREATE VIEW
telephony-purchases AS SELECT product, buyer,
seller, store FROM Purchase, Product WHERE
Purchase.product Product.name
AND Product.category telephony
50A Different View
CREATE VIEW Seattle-view AS SELECT
buyer, seller, product, store FROM
Person, Purchase WHERE Person.city
Seattle AND
Person.name Purchase.buyer
We can later use the views SELECT
name, store FROM Seattle-view,
Product WHERE Seattle-view.product
Product.name AND
Product.category shoes
Whats really happening when we query a view??
51Updating Views
How can I insert a tuple into a table that
doesnt exist? CREATE VIEW bon-purchase AS
SELECT store, seller, product FROM
Purchase WHERE store The Bon
Marche If we make the following insertion
INSERT INTO bon-purchase VALUES
(the Bon Marche, Joe, Denby Mug) We can
simply add a tuple (the Bon Marche,
Joe, NULL, Denby Mug) to relation Purchase.
52Non-Updatable Views
Given Purchase (buyer, seller, store,
product) Person( name, phone-num, city)
CREATE VIEW Seattle-view AS SELECT
seller, product, store FROM Person,
Purchase WHERE Person.city Seattle
AND Person.name
Purchase.buyer
Why non-updatable?
How can we add the following tuple to the view?
(Joe, Shoe Model 12345, Nine West)
53Materialized Views
- Views whose corresponding queries have been
executed and the data is stored in a separate
database - Uses Caching
- Issues
- Using views in answering queries
- Normally, the views are available in addition to
database - (so, views are local caches)
- In information integration, views may be the only
things we have access to. - An internet source that specializes in woody
allen movies can be seen as a view on a database
of all movies. Except, there is no database out
there which contains all movies.. - Maintaining consistency of materialized views
54Issues w.r.t. Databases on the Web
- Information Extraction (invert the tuple to text
transformation) - Support lay user queries
- More flexible queries
- Exact (SQL) vs Approximate/Similar (Text search?)
- On semi-structured databases
- Joins over text attributes?
- Exact (SQL) vs Approximate/Similar !!!!!
- Support integration/aggregation of multiple
databases - Take a query from the user and send it to all
relevant databases - TONS of challenges
55Imprecise Queries
- Increasing number of Web accessible databases
- E.g. bibliographies, reservation systems,
department catalogs etc - Support for precise queries only exactly
matching tuples - Difficulty in extracting desired information
- Limited query capabilities provided by form based
query interface - Lack of schema/domain information
- Increasing complexity of types of data e.g.
hyptertext, images etc - Often times user wants about the same instead
of exact - Bibliography search find similar publications
Solution Provide answers closely matching query
constraints
56Query Optimization
57Query Optimization
Goal
Imperative query execution plan
Declarative SQL query
SELECT S.buyer FROM Purchase P, Person Q WHERE
P.buyerQ.name AND Q.cityseattle AND
Q.phone gt 5430000
- Inputs
- the query
- statistics about the data (indexes,
cardinalities, selectivity factors) - available memory
Ideally Want to find best plan. Practically
Avoid worst plans!
58(On-the-fly)
sname
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
- Goal of optimization To find more efficient
plans that compute the same answer.
(On-the-fly)
rating gt 5
with pipelining )
sidsid
(Use hash
Sailors
bid100
index do
not write
result to
temp)
Reserves
59Optimizing Joins
- Q(u,x) - R(u,v), S(v,w), T(w,x)
- R S T
- Many ways of doing a single join R S
- Symmetric vs. asymmetric join operations
- Nested join, hash join, double pipe-lined hash
join etc. - Processing costs alone vs. processing transfer
costs - Get R and S together vs, get R, get just the
tuples of S that will join with R (semi-join) - Many orders in which to do the join
- (R join S) join T
- (S join R) join T
- (T join S) join R etc.
- All with different costs
60Determining Join Order
- In principle, we need to consider all possible
join orderings - As the number of joins increases, the number of
alternative plans grows rapidly we need to
restrict the search space. - System-R consider only left-deep join trees.
- Left-deep trees allow us to generate all fully
pipelined plansIntermediate results not written
to temporary files. - Not all left-deep trees are fully pipelined
(e.g., SM join).
61Query Optimization Process(simplified a bit)
- Parse the SQL query into a logical tree
- identify distinct blocks (corresponding to nested
sub-queries or views). - Query rewrite phase
- apply algebraic transformations to yield a
cheaper plan. - Merge blocks and move predicates between blocks.
- Optimize each block join ordering.
- Complete the optimization select scheduling
(pipelining strategy).
62Cost Estimation
- For each plan considered, must estimate cost
- Must estimate cost of each operation in plan
tree. - Depends on input cardinalities.
- Must estimate size of result for each operation
in tree! - Use information about the input relations.
- For selections and joins, assume independence of
predicates. - System R cost estimation approach.
- Very inexact, but works ok in practice.
- More sophisticated techniques known now.
63Key Lessons in Optimization
- There are many approaches and many details to
consider in query optimization - Classic search/optimization problem!
- Not completely solved yet!
- Main points to take away are
- Algebraic rules and their use in transformations
of queries. - Deciding on join ordering System-R style
(Selinger style) optimization. - Estimating cost of plans and sizes of
intermediate results.
64Concurrency Control
- Concurrent execution of user programs key to
good DBMS performance. - Disk accesses frequent, pretty slow
- Keep the CPU working on several programs
concurrently. - Interleaving actions of different programs
trouble! - e.g., account-transfer print statement at same
time - DBMS ensures such problems dont arise.
- Users/programmers can pretend they are using a
single-user system. (called Isolation) - Thank goodness! Dont have to program very,
very carefully.
65Transactions ACID Properties
- Key concept is a transaction a sequence of
database actions (reads/writes). - DBMS ensures atomicity (all-or-nothing property)
even if system crashes in the middle of a Xact. - Each transaction, executed completely, must take
the DB between consistent states or must not run
at all. - DBMS ensures that concurrent transactions appear
to run in isolation. - DBMS ensures durability of committed Xacts even
if system crashes. -
- Note can specify simple integrity constraints on
the data. The DBMS enforces these. - Beyond this, the DBMS does not understand the
semantics of the data. - Ensuring that a single transaction (run alone)
preserves consistency is largely the users
responsibility!
66Scheduling Concurrent Transactions
- DBMS ensures that execution of T1, ... , Tn is
equivalent to some serial execution T1 ... Tn. - Before reading/writing an object, a transaction
requests a lock on the object, and waits till the
DBMS gives it the lock. All locks are held
until the end of the transaction. (Strict 2PL
locking protocol.) - Idea If an action of Ti (say, writing X) affects
Tj (which perhaps reads X), say Ti obtains the
lock on X first so Tj is forced to wait until
Ti completes.This effectively orders the
transactions. - What if Tj already has a lock on Y and Ti
later requests a lock on Y? (Deadlock!) Ti or Tj
is aborted and restarted!
67Ensuring Transaction Properites
- DBMS ensures atomicity (all-or-nothing property)
even if system crashes in the middle of a Xact. - DBMS ensures durability of committed Xacts even
if system crashes. - Idea Keep a log (history) of all actions carried
out by the DBMS while executing a set of Xacts - Before a change is made to the database, the
corresponding log entry is forced to a safe
location. (WAL protocol OS support for this is
often inadequate.) - After a crash, the effects of partially executed
transactions are undone using the log. Effects of
committed transactions are redone using the log. - trickier than it sounds!
68Web brings unwashed masses, unreliable medium as
well as dirty data to databases..
- Web accessibility changes the user/data/medium
profile significantly - from SQL gurus supporting financial data on
dedicated DBMS to 2.1 keyword query instant
gratification seekers working with
dirty/inconsistent data over unreliable web. - Challenges
- How does one support keyword queries in
databases? - How does one support imprecise queries in
databases? - How do we handle incompleteness/inconsistency in
databases? - Does it make sense to focus on total response
time minimization - As against a multi-objective cost/benefit
optimization?
The DB community has embraced these challenges
--see Lowell Report
69Specifying Structure The XML Standard
70Specifying Structured Text/Data XML
- XML is the confluence of several factors
- The Web needed a more declarative format for
data, trying to describe the meaning of the data - Documents needed a mechanism for extended tags to
mark structure - Database people needed a more flexible
interchange format - Original expectation
- The whole web would go to XML instead of HTML
- Todays reality
- Not so But XML is used all over under the
covers
Differing Expectations Based on which Side you
came from
71An XML Document Example
- ltimdbgt
- ltshow year1993gt
- lttitlegtFugitive, Thelt/titlegt
- ltreviewgt
- ltsuntimesgt
- ltreviewergtRoger
Ebertlt/reviewergt gives ltratinggttwo thumbs - uplt/ratinggt! A fun action
movie, Harrison Ford at his best. - lt/suntimesgt
- lt/reviewgt
- ltreviewgt
- ltnytgtThe standard hollywood
summer movie strikes back.lt/nytgt - lt/reviewgt
- ltbox_officegt183,752,965lt/box_officegt
- lt/showgt
- ltshow year1994gt
- lttitlegtX Files,Thelt/titlegt
- ltseasonsgt4lt/seasonsgt
- lt/showgt
- lt/imdbgt
Mixed Content
Attribute
72(No Transcript)
73HTML vs. XML
- lth1gt Bibliography lt/h1gt
- ltpgt ltigt Foundations of Databases lt/igt
- Abiteboul, Hull, Vianu
- ltbrgt Addison Wesley, 1995
- ltpgt ltigt Data on the Web lt/igt
- Abiteoul, Buneman, Suciu
- ltbrgt Morgan Kaufmann, 1999
- ltbibliographygt
- ltbookgt lttitlegt Foundations lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
- ltauthorgt Hull lt/authorgt
- ltauthorgt Vianu lt/authorgt
- ltpublishergt Addison Wesley
lt/publishergt - ltyeargt 1995 lt/yeargt
- lt/bookgt
-
- lt/bibliographygt
Self-describing -Schema info part of the
data -Good for data exchange (albeit
baroque for storage)
74lth1gt Bibliography lt/h1gt ltpgt ltigt Foundations of
Databases lt/igt Abiteboul, Hull, Vianu
ltbrgt Addison Wesley, 1995 ltpgt ltigt Data on
the Web lt/igt Abiteoul, Buneman, Suciu
ltbrgt Morgan Kaufmann, 1999
ltbibliographygt ltbookgt lttitlegt Foundations
lt/titlegt ltauthorgt Abiteboul
lt/authorgt ltauthorgt Hull
lt/authorgt ltauthorgt Vianu
lt/authorgt ltpublishergt Addison
Wesley lt/publishergt ltyeargt 1995
lt/yeargt lt/bookgt lt/bibliographygt
HTML describes presentation
XML describes content
XSL (stylesheets) can be used to specify the
conversion
75XML Terminology
- tags book, title, author,
- start tag ltbookgt, end tag lt/bookgt
- elements ltbookgtltbookgt,ltauthorgtlt/authorgt
- elements are nested
- empty element ltredgtlt/redgt abbrv. ltred/gt
- an XML document single root element
well formed XML document if it has matching tags
76DOM Tree (Document-Object Model)
- An XML document can be seen as a hierarchical tree
77XML Order
- If you see an XML file as a text file with tags,
then order should matter - If you see an XML file as a self-describing
version of (relational) data, then order
shouldnt matter - Which should be the default?
78More XML Attributes
- ltbook price 55 currency USDgt
- lttitlegt Foundations of Databases lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
-
- ltyeargt 1995 lt/yeargt
- lt/bookgt
Attributes are single-valued --No
guidance on when to use them
79More XML Oids and References
Object identifiers
- ltperson ido555gt ltnamegt Jane lt/namegt lt/persongt
- ltperson ido456gt ltnamegt Mary lt/namegt
- ltchildren
idrefo123 o555/gt - lt/persongt
- ltperson ido123 mothero456gtltnamegtJohnlt/namegt
- lt/persongt
oids and references in XML are just syntax
80(No Transcript)
81XML Meaning
82XML ? machine accessible meaning
Jim Hendler
This is what a web-page in natural language
looks like for a machine (Unless it is in
Beijing.. ? )
83XML ? machine accessible meaning
Jim Hendler
XML allows meaningful tags to be added toparts
of the text
84XML ? machine accessible meaning
Jim Hendler
But to your machine, the tags look like
this.(assuming it is not in Athens)
85XML ? machine accessible meaning
Jim Hendler
Schemas help.
lt CV gt
by relating common termsbetween documents
private
86But other people use other schemas
Jim Hendler
Someone else has one like this.
87But other people use other schemas
Jim Hendler
lt CV gt
which dont fit in
private
Moral There is still need for
ontology mapping.. ?either by fiat ?or by
learning
88XML Meaning Summary
- XML is a purely syntactic standard
- Saying that something is in XML format is like
saying something is in List or Table format - It is NOT like saying that something in
English/C etc (all of which have specific
semantics) - Tags in XML do not up front have any meaning
- Tags can be overloaded with specific meaning
through prior agreement or standardization - Such agreements/standardization are possible for
specific sub-tasks (e.g. HTML for rendering) or
specific sub-communities (e.g. ebXML etcsee next
slide) - Tags meaning can be expressed by relating them
to other tags - This is the usual knowledge representation way
(meaning comes from inter-predicate relations).
Semantic Web pushes this view. - You can also learn the relations through
context/practice/usage etc. This is the sort of
view taken by (semi-automated) schema-mapping
techniques
89XML Dialect pot pourri
- Extensible Financial Reporting Markup Language
(XFRML), - eXtensible Business Reporting Language (XBRL),
- MusicXML,
- Spacecraft Markup Language (SML),
- Bank Internet Payment System (BIPS),
- Bioinformatic Sequence Markup Language (BSML),
- Biopolymer Markup Language (BIOML),
- Open Catalog Format (OCF),
- Chemical Markup Language (CML),
- Electronic Business XML Initiative (ebXML),
- Open Trading Protocol (OTP),
- FinXML, Financial Information eXchange protocol
(FIX), - RecipeML, CVML,
- XML Bookmark Exchange Language (XBEL),
- Scalable Vector Graphics (SVG),
- NewsML,
- DocBook,
- Real Estate Listing Markup Language (RELML), . . .
Examples of communities that Standardized their
tags
90Who puts everything into XML?
- To a certain extent, this a vaccuous question,
once we realize that XML is just a syntactic
standard - You can put things into XML by just putting
ltbodygt tag (or any tag) at the beginning and end
of the file - XML is not meant to be an imposition but rather a
facilitator - XML facilitates marking up structure if someone
wants to do this. That someone can be - creator of the page
- secondary user who wants to tag the page
- An extraction program that wants to remember the
structure it extracted by tagging the page - The markup tags may or may not have any specific
meaning based on prior agreements/standardization
91Why are IR folks excited about XML?
- XML files are text files with structure
- Structure easily identifiable (the DOM structure)
- We can improve Precision/Recall by taking
structure into account.. - We already did a bite.g. higher weight to words
occuring in the header tags..
92Why are Database folks excited about XML?
- XML is just a syntax for (self-describing) data
- This is still exciting because
- No standard syntax for relational data
- With XML, we can
- Translate any legacy data to XML
- Can exchange data in XML format
- Ship over the web, input to any application
- Talk about querying on seim-structured data
93XML viewed from a Database Point of View
94XML vs. Relational Data
- XML is meant as a language that supports both
Text and Structured Data - Conflicting demands...
- XML supports semi-structured data
- In essence, the schema can be union of multiple
schemas - Easy to represent books with or without prices,
books with any number of authors etc. - XML supports free mixing of text and data
- using the PCDATA type
- XML is ordered (while relational data is
unordered)
95XML Data Model (DOM)
imdb
show
title
review
review
_at_year
Fugitive, The
1993
suntimes
nyt
rating
reviewer
two...
gives
Roger Ebert
- Check http//www.w3.org/XML/ for more details
96DTDs
Notice that DTD is not In XML syntax ?
lt!DOCTYPE paper lt!ELEMENT paper
(section)gt lt!ELEMENT section ((title,section)
text)gt lt!ELEMENT title (PCDATA)gt
lt!ELEMENT text (PCDATA)gt gt
Semi- structured
ltpapergt ltsectiongt lttextgt lt/textgt lt/sectiongt
ltsectiongt lttitlegt lt/titlegt ltsectiongt
lt/sectiongt
ltsectiongt lt/sectiongt
lt/sectiongt lt/papergt
97XML Schema
- Supersedes DTD (and has XML syntax)
- unifies previous schema proposals
- generalizes DTDs
- uses XML syntax
- two documents structure and datatypes
- http//www.w3.org/TR/xmlschema-1
- http//www.w3.org/TR/xmlschema-2
98XML Schema
99RDF Meta-data Standard for Web
- ltrdfDescription aboutwww.mypage.comgt
- ltaboutgt birds, butterflies, snakes
lt/aboutgt - ltauthorgt ltrdfDescriptiongt
- ltfirstnamegt John
lt/firstnamegt - ltlastnamegt Smith
lt/lastnamegt - lt/rdfDescriptiongt
- lt/authorgt
- lt/rdfDescriptiongt
Goodol semantic networks..?
100Xquery Resources
- XQuery 1.0 An XML Query Language
- W3C Working Draft 20 December 2001
- XML Query Use Cases
- W3C Working Draft 20 December 2001
- Microsoft .Net Xquery Language Demo
- http//131.107.228.20/
- http//support.x-hive.com/xquery/index.html
- Supports querying on the documents described in
the W3C Use Cases - Xquery Tutorial by Fankhauser Wadler
- www.research.avayalabs.com/user/wadler/papers/xque
ry-tutorial/ xquery-tutorial.pdf
101http//support.x-hive.com/xquery/index.html
You will be asked to play with it in homework
3
102FLoWeR Expressions
- Xquery queries are made up of FLWR expressions
that work on paths - For binds variables to nodes
- Let computes aggregates
- Where applies a formula to find matching elements
- Return constructs the output elements
- Path expressions are of the form
- element//element/elementattribvalue
103Comparison to SQL
- Look at the use case description on Xquery manual
- Supports all (?) SQL style queries (with
different syntax of course) default queries in
the demo - Has support for
- constructionoutputting the answers in
arbitrary XML formats (use case XMP ) - path expressions --- navigating the XML tree
(use case seq) - Simple text queries use case text
- Allows queries on Tag elements
- Removes the data/meta-data barrier in queries
- For each book that has at least one author, list
the title and first two authors, and an empty
"et-al" element if the book has additional
authors. XMP use case 6
10411/20
Make-up Class Wed 26th 1030AMRoom TBD
(probably 210)
- XQuery IR-style search on XML Semantic Web
standards
105DTD for http//www.bn.com/bib.xml
- lt!ELEMENT bib (book )gt
- lt!ELEMENT book (title, (author editor ),
publisher, price )gt - lt!ATTLIST book year CDATA REQUIRED gt
- lt!ELEMENT author (last, first )gt
- lt!ELEMENT editor (last, first, affiliation )gt
- lt!ELEMENT title (PCDATA )gt
- lt!ELEMENT last (PCDATA )gt
- lt!ELEMENT first (PCDATA )gt
- lt!ELEMENT affiliation (PCDATA )gt
- lt!ELEMENT publisher (PCDATA )gt
- lt!ELEMENT price (PCDATA )gt
106Example Query
Query
Result
- ltbibgt
- for b in /bib/book
- where b/publisher "Addison-Wesley"
- and b/_at_year gt 1991
- return ltbook year b/_at_year gt
- b/title
- lt/bookgt
- lt/bibgt
- For all books after 1991,
- return with Year changed from
- a tag to an attribute
ltbibgt ltbook year"1994"gt lttitlegtTCP/IP
Illustratedlt/titlegt lt/bookgt ltbook
year"1992"gt lttitlegtAdvanced Programming in
the Unix environmentlt/titlegt lt/bookgt lt/bibgt
107Example Query (2)
- Return the books that cost more at amazon than
fatbrain - Let amazon document(http//www.amazon.com/book
s.xml), - Let fatbrain document(http//www.fatbrain.com/
books.xml) - For am in amazon/books/book,
- fat in fatbrain/books/book
- Where am/isbn fat/isbn
- and am/price gt fat/price
- Return ltbookgt am/title, am/price, fat/price
ltbookgt
Join
108XML frenzy in the DB Community
- Now that XML is there, what can we do with it?
- Convert all databases from Relational to XML?
- Or provide XML views of relational databases?
- Develop theory of native XML databases?
- Or assume that XML data will be stored in
relational databases.. - Issues What sort of storage mechanisms? What
sort of indices?
109XML middleware for Databases
RDBMS
On the internet, nobody needs to know that you
are a dog
- XML adapters (middle-ware) received significant
attention in DB community - SilkRoute (ATT)
- Xperanto (IBM)
- Issues
- Need to convert relational data into XML
- Tagging (easy)
- Need to convert Xquery queries into equivalent
SQL queries - Trickier as Xquery supports schema querying
110IR Style Querying of XML Documents
111From Manning et al IR Text
An XML document is represented as a vector in
the space of Lexical Trees Query is an
extended lexical tree Similarity between
Query Lexical tree defined as follows
Within the document, you return the snippet that
is closest..
Note that we are increasing the size of the index
(lexical trees rather than just words), to
exploit Structure. This is normal (i.e., index
becomes larger when structure is present)
112Semantic Web StandardsRDF/RDF-Schema/OWL
113Syntax vs. Semantics
- Syntax provides the grammar for a language (all
you can do is to see whether a sentence is
grammatically correct and do parts of speech
tagging - XML
- Semantics provides the set of worlds where a
particular sentence (or a set of sentences) hold - Many formal languages have well-defined semantics
(Propositional logic first order logic etc.) - Semantic Web involves providing an XML syntax for
representing description logicsa fragment of
First order logic - Has two parts Base facts are represented by RDF
standard - Background Knowledge (axioms etc.)are
represented by RDF-Schema (which is superseded
now by OWL)
114XML isnt enough for Knowledge Exchange..
- XML is a universal metalanguage for defining
markup - It provides a uniform framework for interchange
of data and metadata between applications - However, XML does not provide any means of
talking about the semantics (meaning) of data - E.g., there is no intended meaning associated
with the nesting of tags - It is up to each application to interpret the
nesting.
115Nesting of Tags in XML
- David Billington is a lecturer of Discrete Maths
- ltcourse name"Discrete Maths"gt
- ltlecturergtDavid Billingtonlt/lecturergt
- lt/coursegt
- ltlecturer name"David Billington"gt
- ltteachesgtDiscrete Mathslt/teachesgt
- lt/lecturergt
- Opposite nesting, same information!
116What we want is a standard for representing
knowledge on the web..
- A standard technique for KR is Logic
- So how about we find a way of encoding Logical
statements in XML? - A logical theory consists of
- Base facts
- Background theory
- RDF is a standard for writing (binary predicate)
base-facts - E.g. parent(Tom,Mary)
- RDF-Schema is a standard for writing background
theory.. - E.g. Forallx,y Parent(x,y)gtLoves(x,y)
- Recall that the complexity of inference depends
on the form of background theory (e.g.
semi-decidable for general FOPC and polynomial
for Horn clause. It is also tractable for
description logics where all the background
knowledge is of the form class, sub-class,
instance. This is what RDF-Schema tries to
capture) - RQL is (an emerging?) standard for querying
RDF/RDF-S databases
117Basic Ideas of RDF
- Basic building block object-attribute-value
triple - It is called a statement
- Sentence about Billington is such a statement
- RDF has been given a syntax in XML
- This syntax inherits the benefits of XML
- Other syntactic representations of RDF possible
118Web Schema Languages
- Existing Web languages extended to facilitate
content description - XML ? XML Schema (XMLS)
- RDF ? RDF Schema (RDFS)
- XMLS not an ontology language
- Changes format of DTDs (document schemas) to be
XML - Adds an extensible type hierarchy
- Integers, Strings, etc.
- Can define sub-types, e.g., positive integers
- RDFS is recognisable as an ontology language
- Classes and properties
- Sub/super-classes (and properties)
- Range and domain (of properties)
119RDF and RDFS
- RDF stands for Resource Description Framework
- It is a W3C candidate recommendation
(http//www.w3.org/RDF) - RDF is graphical formalism ( XML syntax
semantics) - for representing metadata
- for describing the semantics of information in a
machine- accessible way - RDFS extends RDF with schema vocabulary, e.g.
- Class, Property
- type, subClassOf, subPropertyOf
- range, domain
120The RDF Data Model
- Statements are ltsubject, predicate, objectgt
triples
- Can be represented using XML serialisation, e.g.
- ltIan,hasColleague,Uligt
- Statements describe properties of resources
- A resource is a URI representing a (class of)
object(s) - a document, a picture, a paragraph on the Web
- http//www.cs.man.ac.uk/index.html
- a book in the library, a real person (?)
- isbn//5031-4444-3333
-
- Properties themselves are also resources (URIs)
121URIs
- URI Uniform Resource Identifier
- "The generic set of all names/addresses that are
short strings that refer to resources - URIs may or may not be dereferencable
- URLs (Uniform Resource Locators) are a particular
type of URI, used for resources that can be
accessed on the WWW (e.g., web pages) - In RDF, URIs typically look like normal URLs,
often with fragment identifiers to point at
specific parts of a document - http//www.somedomain.com/some/path/to/filefragme
ntID
122RDF Syntax
- RDF has an XML syntax that has a specific
meaning - Every Description element describes a resource
- Every attribute or nested element inside a
Description is a property of that Resource with
an associated object resource - Resources are referred to using URIs
- ltDescription about"some.uri/person/ian_horrocks"
gt - lthasColleague resource"some.uri/person/uli_sa
ttler"/gt - lt/Descriptiongt
- ltDescription about"some.uri/person/uli_sattler"gt
- lthasHomePagegthttp//www.cs.mam.ac.uk/sattlerlt
/hasHomePagegt - lt/Descriptiongt
- ltDescription about"some.uri/person/carole_goble"
gt - lthasColleague resource"some.uri/person/uli_sa
ttler"/gt - lt/Descriptiongt
123Linking Statements
- The subject of one statement can be the object of
another - Such collections of statements form a directed,
labeled graph - Note that the object of a triple can also be a
literal (a string) - Note also that RDF triples dont by themselves
give meaning - You know that (1) Ian and Carol are most likely
colleagues (barring multiple jobs for Uli (2)
(Uli hasCollegue Ian) holds (colleagueness
unlike love is symmetric). But DOES YOUR
PROGRAM KNOW THIS?
124A Critical View of RDF Binary Predicates
- RDF uses only binary properties
- This is a restriction because often we use
predicates with more than 2 arguments - But binary predicates can simulate these
- Example referee(X,Y,Z)
- X is the referee in a chess game between players
Y and Z
125A Critical View of RDF Binary Predicates (2)
- We introduce
- a new auxiliary resource chessGame
- the binary predicates ref, player1, and player2
- We can represent referee(X,Y,Z) as
126A Critical View of RDF Properties
- Properties are special kinds of resources
- Properties can be used as the object in an
object-attribute-value triple (statement) - They are defined independent of resources
- This possibility offers flexibility
- But it is unusual for modelling languages and OO
programming languages - It can be confusing for modellers
127A Critical View of RDF Reification
- The reification mechanism is quite powerful
- It appears misplaced in a simple language like
RDF - Making statements about statements introduces a
level of complexity that is not necessary for a
basic layer of the Semantic Web - Instead, it would have appeared more natural to
include it in more powerful layers, which provide
richer representational capabilities
128A Critical View of RDF Summary
- RDF has its idiosyncrasies and is not an optimal
modeling language but - It is already a de facto standard
- It has sufficient expressive power
- At least as for more layers to build on top
- Using RDF offers the benefit that information
maps unambiguously to a model
129RDF Schema (RDFS)
- RDF gives a formalism for meta data annotation,
and a way to write it down in XML, but it does
not give any special meaning to vocabulary such
as subClassOf or type - Interpretation is an arbitrary binary relation
- I.e., ltPerson,subClassOf,Animalgt has no special
meaning - RDF Schema defines schema vocabulary that
supports definition of ontologies - gives extra meaning to particular RDF
predicates and resources (such as subClasOf) - this extra meaning, or semantics, specifies how
a term should be interpreted
NOTICE THAT RDF-SCHEMA is NOT to RDF WHAT
XML-Schema is to XML
130Background Theory
RDF Schema is really RDF background knowledge!
Instances
131RDF/RDFS vs. General Knowledge Rep Reasoning
- We noted that RDF can be seen as base level
facts and RDFS can be seen as background
theory/facts/rules - At this level, inference with RDF/RDFS seems to
be just a special case of Knowledge
Representation Reasoning - This is good (CSE471 Ahoy!) and bad (reasoning
over most non-trivial logics is NP-hard or much
much worse). - RDF/RDFS can be seen as an attempt to li