Title: Detecting Changes to Data and Schema in Semistructured Documents
1Detecting Changes to Data and Schemain
Semi-structured Documents
- Igg Adiwijaya
- CIMIC
- Rutgers University
- Committee members
- Dr. Nabil Adam, Rutgers
- Dr. Vijay Atluri, Rutgers
- Dr. James Geller, NJIT
- Dr. Ron Musick, LLNL
- Dr. Yelena Yesha, UMBC
2Outline
- Motivation
- Related work
- A model for semi-structured documents
- Approach for detecting changes to data
- Approaches for detecting changes to schema
- Periodic, immediate and incorporating errors
- Implementation - LLNL DataFoundry
- Research contributions future directions
3Motivation
- Increasing need to detect changes to
semi-structured documents - Internet - data are stored in semi-structured
documents - Data warehousing
- Data at underlying sources are unstructured,
semi-structured and structured - Extraction components are developed to extract
data from the sources - Changes to data and schema
- Internet
- Users need to be notified upon changes to web
pages of interest - Data warehousing
- Data changes must be detected and propagated to
the warehouse for warehouse update - Schema changes must be detected and used to
update the extraction components to conform to
the new schema. Otherwise they fail to function - Practice
- Internet - manual change detection and using
What is new Page - Data warehousing
- Changes to data refreshing from scratch
(expensive) and incremental update - Changes to schema manual (unacceptable when the
change is frequent)
4Problem Statements
- How to deal with changes to data and schema when
- the information sources are autonomous
- Little cooperation is provided by the sources
- change to schema is frequent
- Such as 2 to 3 times per year in scientific
environment - Manual approach is unacceptable
5Research Objectives
- A model for semi-structured documents
- Capable of representing both data and schema
- Efficient approach to detecting changes to data
- Taking advantage of the fact that parsing new
documents must be performed and the detection is
performed during parsing - Semi-automatic approaches to detecting changes to
schema
6Outline
- Motivation
- Related work
- A model for semi-structured documents
- Approach for detecting changes to data
- Approaches for detecting changes to schema
- Periodic, immediate and incorporating errors
- Implementation - LLNL DataFoundry
- Research contributions future directions
7Related Work Changes to Data
- File-modification date
- Check_sum - for CGI generated documents
- Longest Common Subsequence
- Detect differences between strings
- UNIX Diff
8Related Work Changes to data (contd)
- Ball et.al. of ATT ref
- Views semi-structured documents as containing
- Sentences
- Sentence-breaking markups - e.g. ltPgt, ltHRgt, ltLIgt,
ltH1gt - Non-sentence breaking markups - e.g. ltIMGgt and
ltAgt - Change detection approach
- Compare two versions of the same semi-structured
documents - Match corresponding sentence-breaking markups
- Match corresponding sentences
- Graphical display for the differences
- Show the two versions side by side
- Show the difference only
- Merge the two pages
- Limitation
- Expensive - sentence in the a new version
- may be compared with the all sentences in
- the old version
9Related Work Changes to Data (contd)
- Chawathe et.al. of Stanford ref
- Views content of a semi-structured document as
tree - Node sentence-breaking markup
- Edge hierarchical relationship between nodes
- Change detection approach
- Generate the corresponding tree for successive
versions of the same doc - Transform the new tree to exactly match the old
version - Addition, deletion, reordering, moving
sub-trees operations - Select the sequence with the least total cost
- Limitation - expensive
1
1
3
8
11
3
8
2
4
5
10
9
4
5
9
10
6
7
6
7
Move
Addition
1
1
Deletion
3
2
8
3
11
8
11
5
4
10
4
5
9
10
9
7
6
6
7
10Related Work Changes to Data (contd)
- URL Minder ref
- Approach
- Detects desired changes based on the preceding
word(s) and the following word(s) (supplied by
the users) of the desired portion - Limitations
- User supplied phrases may be deleted or updated
- The same phrase may appear somewhere else in the
document - E.g.,
Although ..... The Consortium uses the Hackensack
River, its tributaries and associated estuarine
salt marshes as a living laboratory for middle
and high school students. The Consortium program
is interdisciplinary and investigates
the environmental, cultural and historical
impacts on the water quality of the Hackensack
River Watershed. Participating teachers attend
training workshops on the use and methodology of
the water quality kits. Consortium teachers
participate in workshops that provide cultural,
historical and environmental background materials
on the Hackensack River Watershed and the
surrounding New York / New Jersey Harbor Complex
including ecological status and trends reports,
field manuals and local histories. Students and
teachers collect data from sites located along
the Hackensack River Watershed. Although this
data includes measurements such as salinity,
dissolved oxygen, pH, nitrates, temperature,
water velocity and turbidity. The
HMDC Environmental Operations Research Laboratory
provides technical and advisory support. This is
the end of HMDC sample paragraph.
Portion of interest
11Related Work - Changes to Schema
- Nestorov ref and Abiteboul ref
- Use edge-label graph to model schema for
semi-structured documents - Store semantic information by labeling the edges
- Limitation
- Labeling the edges is perform manually
- Not all characteristics of schema for
semi-structured documents are represented - E.g., no notion of ordering, of
mandatory/optional nodes, etc.
12Outline
- Motivation
- Related work
- A model for semi-structured documents
- Approach for detecting changes to data
- Approaches for detecting changes to schema
- Periodic, immediate and incorporating errors
- Implementation - LLNL DataFoundry
- Research contributions future directions
13Overview of our approach
- For detecting data and schema changes
?
?
New version
Schema
Old version
Generate Document Graph
Document graph
Generate Schema Graph
Parse using Value-add. Graph
Schema graph
Value-added graph
Merge
?
New doc
Schema
Generate Schema Graph
Parse using Schema Graph
Schema graph
14Semi-structured Documents
- Characteristics
- Irregular structure - different representation
missing structuring - Implicit structure - different interpretation
required parsing - Indicative structure no strict typing policy
- Human data entry errors may be introduced
- A semi-structured document consists of
- Data objects
- Mandatory objects - must exist in the documents
- Optional objects - may or may not exist in the
documents - Relationship
- Mandatory relationships - if one data object
exists, the other must also exist - Optional relationships - If one data object
exists, the other does not need to exist - Ordering
- Significant ordering - proper ordering among a
set of data objects must be preserved - Insignificant ordering - no strict ordering is
preserved for a set of data objects - Repeatability
- Non-repeatable object - 0 or 1 occurrence of an
object may exist - Repeatable object - 2 or more occurrences of an
object may exist
15Model
- We use directed graph to model data and schema
- Consisting of nodes and directed edges
- Representing schema for semi-structured documents
- Data objects
- Mandatory objects - regular node
- Optional objects - optional node
- Relationship
- Mandatory relationships - a single directed edge
exists between two nodes - Optional relationships - at least one optional
node exists between the two nodes - Ordering
- Significant ordering - solid edges are used to
connect parent and its children - Insignificant ordering - dashed edges are used to
connect parent and its children - Repeatability
- Non-repeatable object - no looping directed edge
(repeatable edge) is used - Repeatable object - a repeatable edge is used
- ( Stopping node is used as a no-op for
optional node - and terminator for repeatable node)
a
b
s
e
d
d
c
16Model Schema Graph (contd)
- E.g., simplified schema graph for PDB documents
17Model Document Graph
- View content of a document as instance of schema
- E.g.,
Document graph representation
Content of scientific doc
18Model Value Added Schema Graph
- Instantiated based on an older version of
document - To construct
- Parse the older version using schema graph
- Copy value of objects to the corresponding node
on the schema graph - Extend optional node as necessary to handle loops
- E.g.,
Value-added schema graph
19Outline
- Motivation
- Related work
- A model for semi-structured documents
- Approach for detecting changes to data
- Approaches for detecting changes to schema
- Periodic, immediate and incorporating errors
- Implementation - LLNL DataFoundry
- Research contributions future directions
20Detecting Changes to Data
- Algorithm
- Find the corresponding matching nodes in the
documents and the value-added schema graph while
parsing - If the matching nodes are
- ordered nodes,
- compare their values ? value update
- unordered nodes,
- compare their values ? value update
- compare their position with respects to their
siblings ? reorder - non-repeatable optional nodes,
- compare their identifier ? deletion insertion
- for equal identifier, compare their values ?
value update - repeatable optional nodes, create two lists
starting from the matching sub-root - sequentially compare the two lists
- for matching nodes, compare their values ? value
update - for matching node, identify their location in
the list ? move - for non-matching node in the schema graph/doc ?
deletion/insertion
21Outline
- Motivation
- Related work
- A model for semi-structured documents
- Approach for detecting changes to data
- Approaches for detecting changes to schema
- Periodic, immediate and incorporating errors
- Implementation - LLNL DataFoundry
- Research contributions future directions
22Detecting Changes to Schema
- Types of schema changes
- Reorder of a set of nodes in the schema graph
- Insertion of a new node
- Deletion of an existing node
- Renaming existing nodes
- Addition of a new repeatable edge
- Deletion of an existing repeatable edge
- Conversion of significant into insignificant
ordering - Conversion of insignificant into significant
ordering - Two cases - to simplify the problem
- Non-existence of human errors during data entry
- Existence of human errors during data entry -
misinterpretation of schema changes - Approaches
- Periodic - transforming current schema into new
schema after evaluating a set of documents - Immediate - progressively transforming current
schema after evaluating each new document
23Periodic Vs Immediate
24Periodic Approach
- Algorithm
- While parsing, for each node in the document
- The matching node is discovered in the
value-added schema graph - If the node is
- regular node, their children are compared
- Partially matching nodes ? rename
- Unmatched nodes on the value-added graph ?
deletion - Unmatched nodes on the document ? insertion
- optional node, children of the node in the schema
are compared with the corresponding sub-tree in
the document - Nodes never discovered before ? new/renamed
nodes - Nodes appearing more than once ? repeatable
nodes - Nodes appearing before the last node on the
sub-tree ? repeatable nodes - Node appearing only at the end of the sub-tree ?
non-repeatable node
25Immediate Approach
- Detect schema changes based on discovered events
- Event discovery of a new node, renaming
existing node, different ordering, missing
mandatory node, appearance of mandatory node,
missing optional node, appearance of optional
node, single appearance of repeatable node,
multiple appearance of repeatable node, multiple
appearance of non-repeatable node - Conclusion
- Terminal conclusion - certain conclusion
relevant conclusion - Incomplete conclusion - insufficient events to
derive final conclusion - More new documents need to convert incomplete
into terminal conclusion - E.g., event discovery of a new node
26Complete event-decision diagram
27Incorporating Data Entry Errors
- Types of Errors
- Node
- A mandatory node is mistakenly excluded from new
document - A mandatory node is mistakenly renamed with
different identifier - A removed mandatory node is mistakenly reappeared
- An optional node, which is supposed to appear, is
mistakenly excluded - An optional node, which is supposed to be
excluded, appears - Ordering
- A significant ordering is mistakenly reordered
- An old-removed ordering reappears
- Repeatable edge
- A non-repeatable node mistakenly appears more
than once
28Incorporating Data Entry Errors (contd)
- Periodic approach
- After a significantly large collection of
documents are evaluated for schema changes, a
straightforward statistical measurement can be
used to incorporate errors - E.g., a mandatory node
- Stay mandatory if it appears in 90 or more of
the new documents - 10 is used to incorporate errors
- Have been converted into optional if it appears
in 30-70 of the new docs - 30 is used to incorporate errors
- Have been removed from the new schema if it
appears in lt3 of new docs
29Incorporating Data Entry Errors (contd)
- Immediate approach
- Adopt Bayesian Sequential Analysis Procedure to
incorporate potential errors - Decision criteria posterior probability or
utility value - Parameters
- Different possible actions (terminal decisions
that can be taken) - Different state of the world
- Prior probability value
- Likelihood probability
- Utility value
- Cost of evaluating a new document
30Posterior computation for evaluating two docs
31Outline
- Motivation
- Related work
- A model for semi-structured documents
- Approach for detecting changes to data
- Approaches for detecting changes to schema
- Periodic, immediate and incorporating errors
- Implementation - LLNL DataFoundry
- Research contributions future directions
32Implementation
- DataFoundry at LLNL
- Objective to provide LLNL scientists with a
uniform and semantically consistent repository to
a variety of biological data coming from several
information sources - Environment
- Distributed, autonomous, heterogeneous
information sources - There are semantic and syntactic conflicts among
information sources - The collection of biological documents are large
- Changes to schema is frequent
- To include 50 sources 1 schema change per year
on the average! - Implementation detail
- Implemented a web crawler/ingest engine using
Perl, UnixScript, Internet tool - Periodically visit genome site and detect changes
based on modification dates - Implemented data change detection - Lex/Yacc,
C/C - In progress implementing schema change detection
33Implementation (contd)
- Change management system for LLNL
- Architecture
34Outline
- Motivation
- Related work
- A model for semi-structured documents
- Approach for detecting changes to data
- Approaches for detecting changes to schema
- Periodic, immediate and incorporating errors
- Implementation - LLNL DataFoundry
- Research contributions future directions
35Research Contributions
- Developed a model for semi-structured documents
- Used the model to detect
- Changes to data while parsing the new document
efficiently - Performed during parsing the new documents, I.e.,
cost of change detection is partly incurred by
the cost of parsing - Changes to schema
- Identified the types of schema changes
- Proposed periodic and immediate approaches to
detecting schema changes - Incorporate potential human errors during data
entry
36Future Direction
- Performance study
- Accuracy of constructing the new schema, which is
significantly different from the current schema - For some schema changes
- The current schema may be functional in term of
parsing - Undetectable and a different approach may be
required - For now, domain expert is used to detect
- Example
37Future Direction (contd)
- Determining when change is to data, schema or
both - Currently we determine that the schema has been
changed by - monitoring the file modification date of
free-formatted schema file made publicly
available by the sources - waiting for the current parser to fail to
function - some schema changes does not cause current parser
to fail - Utilizing other statistical tools to detect
changes to schema in the presence of errors - Making use of statistical distribution
- Capable of detecting schema changes that can not
be automatically detected using Bayesian
Sequential Procedure