Title: SemiStructured Data and XML
1Semi-Structured Data and XML
2Agenda
3Semi-Structured Data an Introduction
- What is structured data
- What is non-structured data
- What is semi-structured data
- How is semi-structured data represented?
- What can we do with semi-structured data?
4What is Structured Data?
- Strongly typed variables/attributes
- (ie. int, float, string20)
- Every attribute in a relation is defined for all
records - Data is represented in some organized fashion
5An Example of Structured Data
A relational database can be considered
structured data
6What is Non-Structured Data?
- Data that has no type definitions
- Data is not organized according to any pattern
- No concept of variables or attributes
7An Example of Non-Structured Data
Bob was born sometime in August of 1949. He has
a reasonable salary of 52000. Someone else was
born on the 12th of a different month, his name
is Bill. By the way, Bob was born on the 13th of
August.
As you can see, such data would be almost
impossible to have a computer automatically parse.
8Then what is Semi-Structured Data?
- Anything in between structured and non-structured
data!
9Then what is Semi-Structured Data?
- Everything in between structured and
non-structured data - Variables are loosely typed
- x1 is valid, so is xhello
- A record does not need to have all attributes
defined - ie. In a database of cars, if we dont know the
engine type, we can choose not to define the
field for tha particular record. Whereas in a
structured database, the attribute would be
defined, but set to NULL. - An attribute of a record could be another record
- It does not necessarily have to differentiate
between an identifier and a value
10So how is semi-structured data represented?
- Semi-Structured data can be represented as a tree
11So how is semi-structured data represented?
- Semi-Structured data can be represented in the
form of indented text
Bob Birthday 1949 August 13 Salary 52,0
00 Bill Birthday 1967 April
12So how is semi-structured data represented?
- Semi-Structured data can be represented as a
markup language (ie. HTML, XML, LISP, AceDB,
Tsimmis)
ltemployee id3gt ltnamegtBoblt/namegt ltextensiongt55
13lt/extensiongt ltdepartmentgtSaleslt/departmentgt lts
alarygt45000lt/salarygt lt/employeegt ltemployee
id1gt ltnamegtEdlt/namegt ltextensiongt6766lt/extensi
ongt ltofficegt312lt/officegt ltdepartmentgtExecutivelt/
departmentgt ltsalarygtConfidentiallt/salarygt ltemploy
eegt
13Overview
- Semi-Structured data is not necessarily created
with the intention of being processed. - ie. Web pages are not necessarily intended to be
queried by a language like SQL the web designer,
not taking this into consideration may not make
it easy for the data to be processed by a machine.
14What can we do with Semi-Structured Data?
- Since there is some structure, it can be scanned
and parsed - Once the data is parsed, we can query it using
specialized query languages such as UnQL, GEXT
and Lorel - We can clean it up to be placed into a
structured relational database
15XML an Introduction to XML
- What is XML?
- What does it offer to creators of DBs?
- How can XML be used as a DB?
- Representations of XML
- Other features of XML
- Disadvantages to XML
16Summary / Key Points of Semi-Structured data
- In between structured and non-structured data
- Loosely typed attributes
- Not all attributes need to be defined for every
record - Can be parsed and queried
17What is XML?
- XML stands for eXtensible Markup Language
- Based on tags similar to HTML
- Actually, XHTML is a form of XML
- Used to define markup languages
18What does XML offer to database designers?
- Readable by humans using Unicode or ASCII text
- Easy for computers to parse
- Can easily be used as back-end for web sites
19How can XML be used as a database?
Consider the following data
ltemployee id3gt ltnamegtBoblt/namegt ltextensiongt55
13lt/extensiongt ltdepartmentgtSaleslt/departmentgt lts
alarygt45000lt/salarygt lt/employeegt ltemployee
id1gt ltnamegtEdlt/namegt ltextensiongt6766lt/extensi
ongt ltofficegt312lt/officegt ltdepartmentgtExecutivelt/
departmentgt ltsalarygtConfidentiallt/salarygt ltemploy
eegt
It can be written in XML as follows
Notice that this is semi-structured data, since
not all the fields are filled in and because they
are loosely typed.
20In XML, there are few restrictions to how data
can be laid out
- The tag names can represent either attribute
names or data itself - Tag names can be defined to anything the creator
wishes
21But, there are still a few restrictions
- Every tag that is opened, must be closed.
- ltnamegtBoblt/namegt
- Close tag is not needed for empty data
- ltmyelement/gt
- If one tag is opened inside the field of another
tag, it must be closed before the outer tag is
closed. - ltemployeegtltnamegtBoblt/employeegtlt/namegt
- ltemployeegtltnamegtBobgtlt/namegtlt/employeegt
- Tags are case sensitive
22How can XML be represented?
- As a tree structure
- As text/markup tags
23How can XML be represented?
Take our previous example
- Leaf nodes generally, but do not necessarily
store the data - Recent web browsers will show this structure
24How can XML be represented?
- As a text/markup language
Take our previous example
ltemployee id3gt ltnamegtBoblt/namegt ltextensiongt55
13lt/extensiongt ltdepartmentgtSaleslt/departmentgt lts
alarygt45000lt/salarygt lt/employeegt ltemployee
id1gt ltnamegtEdlt/namegt ltextensiongt6766lt/extensi
ongt ltofficegt312lt/officegt ltdepartmentgtExecutivelt/
departmentgt ltsalarygtConfidentiallt/salarygt ltemploy
eegt
25Other features of XML
- It is easy to parse
- It can be queried like a database
- It can be used with XSL Templates to easily
generate web pages from data - It can be used with DTS (Document Type
Definition) to run as a fully structured database
26Disadvantages to XML
- Difficult create indexes on
- Difficult to optimize queries
- Requires additional disk space
- Text format
- Redundant data in tags
- No single standard of how data should be stored
in XML
27Summary / Key points of XML
- Data stored using text-based markup language
- Can also be represented in tree format
- Can store structured and semi-structured data
- Easy to parse and query, but inefficient
28Where to Get More Information
- Search the web, youll find something!