Title: Data Warehousing/Mining Comp 150 DW Semistructured Data
1Data Warehousing/MiningComp 150 DW
Semistructured Data
2Semistructured Data
- Everything that has no rigid schema
- Schema is contained within the data
(self-describing), OR - No separate schema, OR
- Schema exists but places only loose constraints
on data - Emerged as an important topic for a variety of
reasons - Many data sources like WWW which we would like to
treat as databases but cannot for the lack of
schema - Desirable to have an extremely flexible format
for data exchange between disparate databases - May want to view structured data as
semistructured data for the purpose of browsing
3Motivation
- Some data really is unstructured/semistructured
- World Wide Web,
- Data exchange formats
- Some exotic database management systems, e.g.,
ACeDB, popular with biologists - Data integration
- Browsing
4Motivation - World Wide Web
- Why do we want to treat the Web as a database?
- To maintain integrity
- To query based on structure (as opposed to
content) - To introduce some organization.
- But the Web has no structure. The best we can say
is that it is an enormous graph.
5Motivation - Data Formats
- Much (probably most) of the worlds data is in
data formats - These are formats defined for the interchange and
archiving of data - Data formats vary in generality. ASN.1 and XDR
are quite general - Scientific data formats tend to be fixed
schemas - The textual representation given by data formats
is sometimes not immediately translatable into a
standard relational/object-oriented representation
6Motivation - Data Integration
- Goal is to integrate all types of information,
including unstructured information - Irregular, missing information, structure not
fully known, dynamic schema evolution, etc. - Traditional data models and languages not well
suited - Cannot accommodate heterogeneous data sets
(different types and structures), etc. - Difficult to build software that will easily
convert between two disparate models - OEM (Object Exchange Model)
- Semistructured data model from TSIMMIS project at
Stanford - Internal data structure for exchange of data
between DBMSs - Used by other systems e.g., Windows 95 registry,
Lotus Notes
7Motivation - Browsing
- To query a database one needs to understand the
schema. - However schemas have opaque terminology and the
user may want to start by querying the data with
little or no knowledge of the schema. - Where in the database is the string Casablanca
to be found? - Are there integers in the database greater than
216 ? - What objects in the database have an attribute
name that starts with act? - While extensions to relational query languages
have been proposed for such queries, there is no
generic technique for interpreting them.
8The Model
- Represent data as some kind of graph-like or
tree-like model - Cycles are allowed but usually refer to them as
trees - Several different approaches with minor
differences (easy to convert) - Data on labels or edges, nodes carry information
or not - Straightforward to encode relational and
object-oriented databases - Issue object identity
9Querying Semistructured Data
- There are (at least) three approaches to this
problem - Add arbitrary features to SQL or to your favorite
query language - Find some principled approach to programs that
are based on the type of the data - Represent the graph (or whatever the structure
is) as appropriate predicates and use some
variety of datalog on that structure
10The Extend SQL Approach
- In fact it is an attempt to extend the philosophy
of OQL and comprehension syntax to these new
structures - It is the approach taken in the design of UnQL
and also of Lorel - Looks very similar to OQL (path expressions)
11Example
- select Entry.Movie.Title
- from DB
- where Entry.Movie.Director...
12Syntax Issues
- Need (path) variables to tie paths and edges
together - Paths of arbitrary length
- Find all strings in db
- Find whether Allen acted in Casablanca
- Need regular expresions to constrain paths
- Rich set of overloadings for operators to deal
with comparisons of objects with values and of
values with sets
13Underlying Computational Strategy
- Model graph as a relational database and use
relational query language. - Database large relation (node-id, label, node-id)
- Used by Stanford group in LORE/LOREL
- Complications
- Labels are from heterogeneous set of types, need
more than one relation - Additional relations if info to be stored in
nodes - Various navigation issues
14Semistructured Data - Case StudyObject Exchange
Model
15OEM Features
- Common model for heterogeneous information
exchange, self-describing - Each object
OID
Label
Type
Value
- OID unique identifier or NULL
- Label character string descriptor
- Type atomic data type or set
- Value atomic value or set of object references
- Help pages for labels
- Query language OEM-QL
16Representing Semistructured Data Using OEM
Label
ltcollection, b1, a1, ...gt b1 ltbook, t, agt
t lttitle, Database and ...gt a
ltauthor, n, pgt n ltname, Jeff Ullmangt p
ltpicture, /gifs/ullman.gifgt a1 ltarticle, v,
w, xgt v ltauthor, Gio Wiederholdgt w lttitle,
Mediators in the gt x ltjournal, IEEE
Computergt
Set Value
Memory Addresses
Atomic Value
...
17An OEM Query Language OEM-QL
- Logic-based language for OEM
- Match object patterns, generate variable
bindings, construct new OEM objects from existing
ones - Get articles published in IEEE Computer
- P -
- Pltarticles ltjournal IEEE Computergtgt
- Get titles of books by Jeff Ullman
- ltanswer_title Tgt -
- ltbook ltauthor Jeff Ullmangt lttitle Tgtgt
18Semistructured Data - Case StudyWWW Extraction
19Problem
- Lots of valuable information on the Web
- irregular structure
- highly dynamic
- Embedded in HTML
- Limited query facilities
20Data Extraction Tool
- Flexible, easy to use
- Accommodate virtually any HTML source
- Interface with existing system, e.g., data
warehouse, user interface for querying
Query
Data Warehouse
World Wide Web
Extractor
WH Integrator
Specification
21Approach
- Extract Web data into OEM format
- Query using OEM-QL
- Python-based, configurable parser
- Declarative description of HTML source
- location of data on page
- how to package data into OEM
- Regular expression-like syntax
- Human intelligence rather than A.I.
22Extractor Specification
Consists of commands of the form
23HTML Source File
ltHTMLgt ltHEADgt . . . ltTABLEgt ltTRgt ltTHgtltIgt
header 1 lt/Igtlt/THgt ltTHgtltIgt header 2
lt/Igtlt/THgt ltTHgtltIgt header 3 lt/Igtlt/THgt lt/TRgt
ltTRgt ltTDgt text 1 lt/TDgt ltTDgtltA
HREFhttp//www.stuff/gt text 2 lt/Agtlt/TDgt
ltTDgt text 3 lt/TDgt lt/TRgt . .
. lt/TABLEgt . . . lt/BODYgt lt/HTMLgt
24Specification File
root, get('http//www.example.test/'),
, __tempvar1, root,
lttablegtlt/tablegt , __tempvar2, split
(__tempvar1,lt/trgt), , rows,
__tempvar21-1, , header1,header2_url
,header2,header3, rows,
lttdgtlt/tdgtltahrefgtlt/agtlttdgtlt/tdgt
25Result OEM Object
26Basic SyntaxVariable
- variable(lpt)
- optional parameters for specification of
corresponding OEM object - l label name
- t type
- p parent object
- _variable
- temporary data structure, does not appear as OEM
object
27Basic Syntax Source
- split(variable,token)
- creates a list with multiple elements using token
as the element separator - get(URL)
- obtain contents of HTML file at address URL
28Basic Syntax Patterns
- token1 token2
- match and store current input (between tokens)
- token1 token2
- match, dont store current input (between tokens)
29Syntactic Sugar
- Functions for extracting commonly used HTML
constructs - extract_table(variable),pattern
- split_table_row(variable)
- split_table_column(variable)
- extract_list(variable),pattern
- split_list(variables)
30Advanced Features
- Customization of output
- structure, label names, data type, ...
- Extraction across multiple HTML pages
- Graceful recovery from parse errors
- resume parsing using next input from source
- Multiple patterns in single command
- follow different parse tree depending on
structure in source
31Sample Extraction Scenario
. . .
32Extracted OEM Data
OEM-QL query ltcity C lthigh Hgt lt low Lgtgt -
lttemperature ltcity_temp ltcountry
Germanygt ltcity Cgt lthigh_today Hgt ltlow_today
Lgtgtgt
33Evaluation
- Better than
- writing programs
- YACC, PERL, etc.
- A.I.
- Can do better
- GUI tool to simplify the generation of extractor
specification - Machine learning or data mining techniques to
automatically infer structure...