Data Warehousing/Mining Comp 150 DW Semistructured Data - PowerPoint PPT Presentation

About This Presentation

Title:

Data Warehousing/Mining Comp 150 DW Semistructured Data

Description:

Where in the database is the string 'Casablanca' to be found? ... 'Find whether 'Allen' acted in 'Casablanca' Need regular expresions to constrain paths ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 34

Provided by: csTu6

Learn more at: https://www.cs.tufts.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehousing/Mining Comp 150 DW Semistructured Data

1
Data Warehousing/MiningComp 150 DW
Semistructured Data

Instructor Dan Hebert

2
Semistructured Data

Everything that has no rigid schema
Schema is contained within the data
(self-describing), OR
No separate schema, OR
Schema exists but places only loose constraints
on data
Emerged as an important topic for a variety of
reasons
Many data sources like WWW which we would like to
treat as databases but cannot for the lack of
schema
Desirable to have an extremely flexible format
for data exchange between disparate databases
May want to view structured data as
semistructured data for the purpose of browsing

3
Motivation

Some data really is unstructured/semistructured
World Wide Web,
Data exchange formats
Some exotic database management systems, e.g.,
ACeDB, popular with biologists
Data integration
Browsing

4
Motivation - World Wide Web

Why do we want to treat the Web as a database?
To maintain integrity
To query based on structure (as opposed to
content)
To introduce some organization.
But the Web has no structure. The best we can say
is that it is an enormous graph.

5
Motivation - Data Formats

Much (probably most) of the worlds data is in
data formats
These are formats defined for the interchange and
archiving of data
Data formats vary in generality. ASN.1 and XDR
are quite general
Scientific data formats tend to be fixed
schemas
The textual representation given by data formats
is sometimes not immediately translatable into a
standard relational/object-oriented representation

6
Motivation - Data Integration

Goal is to integrate all types of information,
including unstructured information
Irregular, missing information, structure not
fully known, dynamic schema evolution, etc.
Traditional data models and languages not well
suited
Cannot accommodate heterogeneous data sets
(different types and structures), etc.
Difficult to build software that will easily
convert between two disparate models
OEM (Object Exchange Model)
Semistructured data model from TSIMMIS project at
Stanford
Internal data structure for exchange of data
between DBMSs
Used by other systems e.g., Windows 95 registry,
Lotus Notes

7
Motivation - Browsing

To query a database one needs to understand the
schema.
However schemas have opaque terminology and the
user may want to start by querying the data with
little or no knowledge of the schema.
Where in the database is the string Casablanca
to be found?
Are there integers in the database greater than
216 ?
What objects in the database have an attribute
name that starts with act?
While extensions to relational query languages
have been proposed for such queries, there is no
generic technique for interpreting them.

8
The Model

Represent data as some kind of graph-like or
tree-like model
Cycles are allowed but usually refer to them as
trees
Several different approaches with minor
differences (easy to convert)
Data on labels or edges, nodes carry information
or not
Straightforward to encode relational and
object-oriented databases
Issue object identity

9
Querying Semistructured Data

There are (at least) three approaches to this
problem
Add arbitrary features to SQL or to your favorite
query language
Find some principled approach to programs that
are based on the type of the data
Represent the graph (or whatever the structure
is) as appropriate predicates and use some
variety of datalog on that structure

10
The Extend SQL Approach

In fact it is an attempt to extend the philosophy
of OQL and comprehension syntax to these new
structures
It is the approach taken in the design of UnQL
and also of Lorel
Looks very similar to OQL (path expressions)

11
Example

select Entry.Movie.Title
from DB
where Entry.Movie.Director...

12
Syntax Issues

Need (path) variables to tie paths and edges
together
Paths of arbitrary length
Find all strings in db
Find whether Allen acted in Casablanca
Need regular expresions to constrain paths
Rich set of overloadings for operators to deal
with comparisons of objects with values and of
values with sets

13
Underlying Computational Strategy

Model graph as a relational database and use
relational query language.
Database large relation (node-id, label, node-id)
Used by Stanford group in LORE/LOREL
Complications
Labels are from heterogeneous set of types, need
more than one relation
Additional relations if info to be stored in
nodes
Various navigation issues

14
Semistructured Data - Case StudyObject Exchange
Model
15
OEM Features

Common model for heterogeneous information
exchange, self-describing
Each object

OID
Label
Type
Value

OID unique identifier or NULL
Label character string descriptor
Type atomic data type or set
Value atomic value or set of object references

Help pages for labels
Query language OEM-QL

16
Representing Semistructured Data Using OEM
Label
ltcollection, b1, a1, ...gt b1 ltbook, t, agt
t lttitle, Database and ...gt a
ltauthor, n, pgt n ltname, Jeff Ullmangt p
ltpicture, /gifs/ullman.gifgt a1 ltarticle, v,
w, xgt v ltauthor, Gio Wiederholdgt w lttitle,
Mediators in the gt x ltjournal, IEEE
Computergt
Set Value
Memory Addresses
Atomic Value
...
17
An OEM Query Language OEM-QL

Logic-based language for OEM
Match object patterns, generate variable
bindings, construct new OEM objects from existing
ones
Get articles published in IEEE Computer
P -
Pltarticles ltjournal IEEE Computergtgt
Get titles of books by Jeff Ullman
ltanswer_title Tgt -
ltbook ltauthor Jeff Ullmangt lttitle Tgtgt

18
Semistructured Data - Case StudyWWW Extraction
19
Problem

Lots of valuable information on the Web
irregular structure
highly dynamic
Embedded in HTML
Limited query facilities

20
Data Extraction Tool

Flexible, easy to use
Accommodate virtually any HTML source
Interface with existing system, e.g., data
warehouse, user interface for querying

Query
Data Warehouse
World Wide Web
Extractor
WH Integrator
Specification
21
Approach

Extract Web data into OEM format
Query using OEM-QL
Python-based, configurable parser
Declarative description of HTML source
location of data on page
how to package data into OEM
Regular expression-like syntax
Human intelligence rather than A.I.

22
Extractor Specification
Consists of commands of the form
23
HTML Source File
ltHTMLgt ltHEADgt . . . ltTABLEgt ltTRgt ltTHgtltIgt
header 1 lt/Igtlt/THgt ltTHgtltIgt header 2
lt/Igtlt/THgt ltTHgtltIgt header 3 lt/Igtlt/THgt lt/TRgt
ltTRgt ltTDgt text 1 lt/TDgt ltTDgtltA
HREFhttp//www.stuff/gt text 2 lt/Agtlt/TDgt
ltTDgt text 3 lt/TDgt lt/TRgt . .
. lt/TABLEgt . . . lt/BODYgt lt/HTMLgt
24
Specification File
root, get('http//www.example.test/'),
, __tempvar1, root,
lttablegtlt/tablegt , __tempvar2, split
(__tempvar1,lt/trgt), , rows,
__tempvar21-1, , header1,header2_url
,header2,header3, rows,
lttdgtlt/tdgtltahrefgtlt/agtlttdgtlt/tdgt
25
Result OEM Object
26
Basic SyntaxVariable

variable(lpt)
optional parameters for specification of
corresponding OEM object
l label name
t type
p parent object
_variable
temporary data structure, does not appear as OEM
object

27
Basic Syntax Source

split(variable,token)
creates a list with multiple elements using token
as the element separator
get(URL)
obtain contents of HTML file at address URL

28
Basic Syntax Patterns

token1 token2
match and store current input (between tokens)
token1 token2
match, dont store current input (between tokens)

29
Syntactic Sugar

Functions for extracting commonly used HTML
constructs
extract_table(variable),pattern
split_table_row(variable)
split_table_column(variable)
extract_list(variable),pattern
split_list(variables)

30
Advanced Features

Customization of output
structure, label names, data type, ...
Extraction across multiple HTML pages
Graceful recovery from parse errors
resume parsing using next input from source
Multiple patterns in single command
follow different parse tree depending on
structure in source

31
Sample Extraction Scenario
. . .
32
Extracted OEM Data
OEM-QL query ltcity C lthigh Hgt lt low Lgtgt -
lttemperature ltcity_temp ltcountry
Germanygt ltcity Cgt lthigh_today Hgt ltlow_today
Lgtgtgt
33
Evaluation