Detecting Changes to Data and Schema in Semistructured Documents

About This Presentation

Title:

Detecting Changes to Data and Schema in Semistructured Documents

Description:

... is mistakenly reappeared. An optional node, which is supposed to appear, is mistakenly excluded ... A non-repeatable node mistakenly appears more than once. 28 ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 38

Provided by: drnabi

Category:

more less

Transcript and Presenter's Notes

Title: Detecting Changes to Data and Schema in Semistructured Documents

1
Detecting Changes to Data and Schemain
Semi-structured Documents

Igg Adiwijaya
CIMIC
Rutgers University
Committee members
Dr. Nabil Adam, Rutgers
Dr. Vijay Atluri, Rutgers
Dr. James Geller, NJIT
Dr. Ron Musick, LLNL
Dr. Yelena Yesha, UMBC

2
Outline

Motivation
Related work
A model for semi-structured documents
Approach for detecting changes to data
Approaches for detecting changes to schema
Periodic, immediate and incorporating errors
Implementation - LLNL DataFoundry
Research contributions future directions

3
Motivation

Increasing need to detect changes to
semi-structured documents
Internet - data are stored in semi-structured
documents
Data warehousing
Data at underlying sources are unstructured,
semi-structured and structured
Extraction components are developed to extract
data from the sources
Changes to data and schema
Internet
Users need to be notified upon changes to web
pages of interest
Data warehousing
Data changes must be detected and propagated to
the warehouse for warehouse update
Schema changes must be detected and used to
update the extraction components to conform to
the new schema. Otherwise they fail to function
Practice
Internet - manual change detection and using
What is new Page
Data warehousing
Changes to data refreshing from scratch
(expensive) and incremental update
Changes to schema manual (unacceptable when the
change is frequent)

4
Problem Statements

How to deal with changes to data and schema when
the information sources are autonomous
Little cooperation is provided by the sources
change to schema is frequent
Such as 2 to 3 times per year in scientific
environment
Manual approach is unacceptable

5
Research Objectives

A model for semi-structured documents
Capable of representing both data and schema
Efficient approach to detecting changes to data
Taking advantage of the fact that parsing new
documents must be performed and the detection is
performed during parsing
Semi-automatic approaches to detecting changes to
schema

6
Outline

Motivation
Related work
A model for semi-structured documents
Approach for detecting changes to data
Approaches for detecting changes to schema
Periodic, immediate and incorporating errors
Implementation - LLNL DataFoundry
Research contributions future directions

7
Related Work Changes to Data

File-modification date
Check_sum - for CGI generated documents
Longest Common Subsequence
Detect differences between strings
UNIX Diff

8
Related Work Changes to data (contd)

Ball et.al. of ATT ref
Views semi-structured documents as containing
Sentences
Sentence-breaking markups - e.g. ltPgt, ltHRgt, ltLIgt,
ltH1gt
Non-sentence breaking markups - e.g. ltIMGgt and
ltAgt
Change detection approach
Compare two versions of the same semi-structured
documents
Match corresponding sentence-breaking markups
Match corresponding sentences
Graphical display for the differences
Show the two versions side by side
Show the difference only
Merge the two pages
Limitation
Expensive - sentence in the a new version
may be compared with the all sentences in
the old version

9
Related Work Changes to Data (contd)

Chawathe et.al. of Stanford ref
Views content of a semi-structured document as
tree
Node sentence-breaking markup
Edge hierarchical relationship between nodes
Change detection approach
Generate the corresponding tree for successive
versions of the same doc
Transform the new tree to exactly match the old
version
Addition, deletion, reordering, moving
sub-trees operations
Select the sequence with the least total cost
Limitation - expensive

1
1
3
8
11
3
8
2
4
5
10
9
4
5
9
10
6
7
6
7
Move
Addition
1
1
Deletion
3
2
8
3
11
8
11
5
4
10
4
5
9
10
9
7
6
6
7
10
Related Work Changes to Data (contd)

URL Minder ref
Approach
Detects desired changes based on the preceding
word(s) and the following word(s) (supplied by
the users) of the desired portion
Limitations
User supplied phrases may be deleted or updated
The same phrase may appear somewhere else in the
document
E.g.,

Although ..... The Consortium uses the Hackensack
River, its tributaries and associated estuarine
salt marshes as a living laboratory for middle
and high school students. The Consortium program
is interdisciplinary and investigates
the environmental, cultural and historical
impacts on the water quality of the Hackensack
River Watershed. Participating teachers attend
training workshops on the use and methodology of
the water quality kits. Consortium teachers
participate in workshops that provide cultural,
historical and environmental background materials
on the Hackensack River Watershed and the
surrounding New York / New Jersey Harbor Complex
including ecological status and trends reports,
field manuals and local histories. Students and
teachers collect data from sites located along
the Hackensack River Watershed. Although this
data includes measurements such as salinity,
dissolved oxygen, pH, nitrates, temperature,
water velocity and turbidity. The
HMDC Environmental Operations Research Laboratory
provides technical and advisory support. This is
the end of HMDC sample paragraph.
Portion of interest
11
Related Work - Changes to Schema

Nestorov ref and Abiteboul ref
Use edge-label graph to model schema for
semi-structured documents
Store semantic information by labeling the edges
Limitation
Labeling the edges is perform manually
Not all characteristics of schema for
semi-structured documents are represented
E.g., no notion of ordering, of
mandatory/optional nodes, etc.

12
Outline

Motivation
Related work
A model for semi-structured documents
Approach for detecting changes to data
Approaches for detecting changes to schema
Periodic, immediate and incorporating errors
Implementation - LLNL DataFoundry
Research contributions future directions

13
Overview of our approach

For detecting data and schema changes

?
?
New version
Schema
Old version
Generate Document Graph
Document graph
Generate Schema Graph
Parse using Value-add. Graph
Schema graph
Value-added graph
Merge
?
New doc
Schema
Generate Schema Graph
Parse using Schema Graph
Schema graph
14
Semi-structured Documents

Characteristics
Irregular structure - different representation
missing structuring
Implicit structure - different interpretation
required parsing
Indicative structure no strict typing policy
Human data entry errors may be introduced
A semi-structured document consists of
Data objects
Mandatory objects - must exist in the documents
Optional objects - may or may not exist in the
documents
Relationship
Mandatory relationships - if one data object
exists, the other must also exist
Optional relationships - If one data object
exists, the other does not need to exist
Ordering
Significant ordering - proper ordering among a
set of data objects must be preserved
Insignificant ordering - no strict ordering is
preserved for a set of data objects
Repeatability
Non-repeatable object - 0 or 1 occurrence of an
object may exist
Repeatable object - 2 or more occurrences of an
object may exist

15
Model

We use directed graph to model data and schema
Consisting of nodes and directed edges
Representing schema for semi-structured documents
Data objects
Mandatory objects - regular node
Optional objects - optional node
Relationship
Mandatory relationships - a single directed edge
exists between two nodes
Optional relationships - at least one optional
node exists between the two nodes
Ordering
Significant ordering - solid edges are used to
connect parent and its children
Insignificant ordering - dashed edges are used to
connect parent and its children
Repeatability
Non-repeatable object - no looping directed edge
(repeatable edge) is used
Repeatable object - a repeatable edge is used
( Stopping node is used as a no-op for
optional node
and terminator for repeatable node)

a
b
s
e
d
d
c
16
Model Schema Graph (contd)

E.g., simplified schema graph for PDB documents

17
Model Document Graph

View content of a document as instance of schema
E.g.,

Document graph representation
Content of scientific doc
18
Model Value Added Schema Graph

Instantiated based on an older version of
document
To construct
Parse the older version using schema graph
Copy value of objects to the corresponding node
on the schema graph
Extend optional node as necessary to handle loops
E.g.,

Value-added schema graph
19
Outline

Motivation
Related work
A model for semi-structured documents
Approach for detecting changes to data
Approaches for detecting changes to schema
Periodic, immediate and incorporating errors
Implementation - LLNL DataFoundry
Research contributions future directions

20
Detecting Changes to Data

Algorithm
Find the corresponding matching nodes in the
documents and the value-added schema graph while
parsing
If the matching nodes are
ordered nodes,
compare their values ? value update
unordered nodes,
compare their values ? value update
compare their position with respects to their
siblings ? reorder
non-repeatable optional nodes,
compare their identifier ? deletion insertion
for equal identifier, compare their values ?
value update
repeatable optional nodes, create two lists
starting from the matching sub-root
sequentially compare the two lists
for matching nodes, compare their values ? value
update
for matching node, identify their location in
the list ? move
for non-matching node in the schema graph/doc ?
deletion/insertion

21
Outline

Motivation
Related work
A model for semi-structured documents
Approach for detecting changes to data
Approaches for detecting changes to schema
Periodic, immediate and incorporating errors
Implementation - LLNL DataFoundry
Research contributions future directions

22
Detecting Changes to Schema

Types of schema changes
Reorder of a set of nodes in the schema graph
Insertion of a new node
Deletion of an existing node
Renaming existing nodes
Addition of a new repeatable edge
Deletion of an existing repeatable edge
Conversion of significant into insignificant
ordering
Conversion of insignificant into significant
ordering
Two cases - to simplify the problem
Non-existence of human errors during data entry
Existence of human errors during data entry -
misinterpretation of schema changes
Approaches
Periodic - transforming current schema into new
schema after evaluating a set of documents
Immediate - progressively transforming current
schema after evaluating each new document

23
Periodic Vs Immediate
24
Periodic Approach

Algorithm
While parsing, for each node in the document
The matching node is discovered in the
value-added schema graph
If the node is
regular node, their children are compared
Partially matching nodes ? rename
Unmatched nodes on the value-added graph ?
deletion
Unmatched nodes on the document ? insertion
optional node, children of the node in the schema
are compared with the corresponding sub-tree in
the document
Nodes never discovered before ? new/renamed
nodes
Nodes appearing more than once ? repeatable
nodes
Nodes appearing before the last node on the
sub-tree ? repeatable nodes
Node appearing only at the end of the sub-tree ?
non-repeatable node

25
Immediate Approach

Detect schema changes based on discovered events
Event discovery of a new node, renaming
existing node, different ordering, missing
mandatory node, appearance of mandatory node,
missing optional node, appearance of optional
node, single appearance of repeatable node,
multiple appearance of repeatable node, multiple
appearance of non-repeatable node
Conclusion
Terminal conclusion - certain conclusion
relevant conclusion
Incomplete conclusion - insufficient events to
derive final conclusion
More new documents need to convert incomplete
into terminal conclusion
E.g., event discovery of a new node

26
Complete event-decision diagram
27
Incorporating Data Entry Errors

Types of Errors
Node
A mandatory node is mistakenly excluded from new
document
A mandatory node is mistakenly renamed with
different identifier
A removed mandatory node is mistakenly reappeared
An optional node, which is supposed to appear, is
mistakenly excluded
An optional node, which is supposed to be
excluded, appears
Ordering
A significant ordering is mistakenly reordered
An old-removed ordering reappears
Repeatable edge
A non-repeatable node mistakenly appears more
than once

28
Incorporating Data Entry Errors (contd)

Periodic approach
After a significantly large collection of
documents are evaluated for schema changes, a
straightforward statistical measurement can be
used to incorporate errors
E.g., a mandatory node
Stay mandatory if it appears in 90 or more of
the new documents
10 is used to incorporate errors
Have been converted into optional if it appears
in 30-70 of the new docs
30 is used to incorporate errors
Have been removed from the new schema if it
appears in lt3 of new docs

29
Incorporating Data Entry Errors (contd)

Immediate approach
Adopt Bayesian Sequential Analysis Procedure to
incorporate potential errors
Decision criteria posterior probability or
utility value
Parameters
Different possible actions (terminal decisions
that can be taken)
Different state of the world
Prior probability value
Likelihood probability
Utility value
Cost of evaluating a new document

30
Posterior computation for evaluating two docs
31
Outline

Motivation
Related work
A model for semi-structured documents
Approach for detecting changes to data
Approaches for detecting changes to schema
Periodic, immediate and incorporating errors
Implementation - LLNL DataFoundry
Research contributions future directions

32
Implementation

DataFoundry at LLNL
Objective to provide LLNL scientists with a
uniform and semantically consistent repository to
a variety of biological data coming from several
information sources
Environment
Distributed, autonomous, heterogeneous
information sources
There are semantic and syntactic conflicts among
information sources
The collection of biological documents are large
Changes to schema is frequent
To include 50 sources 1 schema change per year
on the average!
Implementation detail
Implemented a web crawler/ingest engine using
Perl, UnixScript, Internet tool
Periodically visit genome site and detect changes
based on modification dates
Implemented data change detection - Lex/Yacc,
C/C
In progress implementing schema change detection

33
Implementation (contd)

Change management system for LLNL
Architecture

34
Outline

Motivation
Related work
A model for semi-structured documents
Approach for detecting changes to data
Approaches for detecting changes to schema
Periodic, immediate and incorporating errors
Implementation - LLNL DataFoundry
Research contributions future directions

35
Research Contributions

Developed a model for semi-structured documents
Used the model to detect
Changes to data while parsing the new document
efficiently
Performed during parsing the new documents, I.e.,
cost of change detection is partly incurred by
the cost of parsing
Changes to schema
Identified the types of schema changes
Proposed periodic and immediate approaches to
detecting schema changes
Incorporate potential human errors during data
entry

36
Future Direction

Performance study
Accuracy of constructing the new schema, which is
significantly different from the current schema
For some schema changes
The current schema may be functional in term of
parsing
Undetectable and a different approach may be
required
For now, domain expert is used to detect
Example

37
Future Direction (contd)

Determining when change is to data, schema or
both
Currently we determine that the schema has been
changed by
monitoring the file modification date of
free-formatted schema file made publicly
available by the sources
waiting for the current parser to fail to
function
some schema changes does not cause current parser
to fail
Utilizing other statistical tools to detect
changes to schema in the presence of errors
Making use of statistical distribution
Capable of detecting schema changes that can not
be automatically detected using Bayesian
Sequential Procedure