Detecting Changes to Data and Schema in Semistructured Documents - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Detecting Changes to Data and Schema in Semistructured Documents

Description:

... is mistakenly reappeared. An optional node, which is supposed to appear, is mistakenly excluded ... A non-repeatable node mistakenly appears more than once. 28 ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 38
Provided by: drnabi
Category:

less

Transcript and Presenter's Notes

Title: Detecting Changes to Data and Schema in Semistructured Documents


1
Detecting Changes to Data and Schemain
Semi-structured Documents
  • Igg Adiwijaya
  • CIMIC
  • Rutgers University
  • Committee members
  • Dr. Nabil Adam, Rutgers
  • Dr. Vijay Atluri, Rutgers
  • Dr. James Geller, NJIT
  • Dr. Ron Musick, LLNL
  • Dr. Yelena Yesha, UMBC

2
Outline
  • Motivation
  • Related work
  • A model for semi-structured documents
  • Approach for detecting changes to data
  • Approaches for detecting changes to schema
  • Periodic, immediate and incorporating errors
  • Implementation - LLNL DataFoundry
  • Research contributions future directions

3
Motivation
  • Increasing need to detect changes to
    semi-structured documents
  • Internet - data are stored in semi-structured
    documents
  • Data warehousing
  • Data at underlying sources are unstructured,
    semi-structured and structured
  • Extraction components are developed to extract
    data from the sources
  • Changes to data and schema
  • Internet
  • Users need to be notified upon changes to web
    pages of interest
  • Data warehousing
  • Data changes must be detected and propagated to
    the warehouse for warehouse update
  • Schema changes must be detected and used to
    update the extraction components to conform to
    the new schema. Otherwise they fail to function
  • Practice
  • Internet - manual change detection and using
    What is new Page
  • Data warehousing
  • Changes to data refreshing from scratch
    (expensive) and incremental update
  • Changes to schema manual (unacceptable when the
    change is frequent)

4
Problem Statements
  • How to deal with changes to data and schema when
  • the information sources are autonomous
  • Little cooperation is provided by the sources
  • change to schema is frequent
  • Such as 2 to 3 times per year in scientific
    environment
  • Manual approach is unacceptable

5
Research Objectives
  • A model for semi-structured documents
  • Capable of representing both data and schema
  • Efficient approach to detecting changes to data
  • Taking advantage of the fact that parsing new
    documents must be performed and the detection is
    performed during parsing
  • Semi-automatic approaches to detecting changes to
    schema

6
Outline
  • Motivation
  • Related work
  • A model for semi-structured documents
  • Approach for detecting changes to data
  • Approaches for detecting changes to schema
  • Periodic, immediate and incorporating errors
  • Implementation - LLNL DataFoundry
  • Research contributions future directions

7
Related Work Changes to Data
  • File-modification date
  • Check_sum - for CGI generated documents
  • Longest Common Subsequence
  • Detect differences between strings
  • UNIX Diff

8
Related Work Changes to data (contd)
  • Ball et.al. of ATT ref
  • Views semi-structured documents as containing
  • Sentences
  • Sentence-breaking markups - e.g. ltPgt, ltHRgt, ltLIgt,
    ltH1gt
  • Non-sentence breaking markups - e.g. ltIMGgt and
    ltAgt
  • Change detection approach
  • Compare two versions of the same semi-structured
    documents
  • Match corresponding sentence-breaking markups
  • Match corresponding sentences
  • Graphical display for the differences
  • Show the two versions side by side
  • Show the difference only
  • Merge the two pages
  • Limitation
  • Expensive - sentence in the a new version
  • may be compared with the all sentences in
  • the old version

9
Related Work Changes to Data (contd)
  • Chawathe et.al. of Stanford ref
  • Views content of a semi-structured document as
    tree
  • Node sentence-breaking markup
  • Edge hierarchical relationship between nodes
  • Change detection approach
  • Generate the corresponding tree for successive
    versions of the same doc
  • Transform the new tree to exactly match the old
    version
  • Addition, deletion, reordering, moving
    sub-trees operations
  • Select the sequence with the least total cost
  • Limitation - expensive

1
1
3
8
11
3
8
2
4
5
10
9
4
5
9
10
6
7
6
7
Move
Addition
1
1
Deletion
3
2
8
3
11
8
11
5
4
10
4
5
9
10
9
7
6
6
7
10
Related Work Changes to Data (contd)
  • URL Minder ref
  • Approach
  • Detects desired changes based on the preceding
    word(s) and the following word(s) (supplied by
    the users) of the desired portion
  • Limitations
  • User supplied phrases may be deleted or updated
  • The same phrase may appear somewhere else in the
    document
  • E.g.,

Although ..... The Consortium uses the Hackensack
River, its tributaries and associated estuarine
salt marshes as a living laboratory for middle
and high school students. The Consortium program
is interdisciplinary and investigates
the environmental, cultural and historical
impacts on the water quality of the Hackensack
River Watershed. Participating teachers attend
training workshops on the use and methodology of
the water quality kits. Consortium teachers
participate in workshops that provide cultural,
historical and environmental background materials
on the Hackensack River Watershed and the
surrounding New York / New Jersey Harbor Complex
including ecological status and trends reports,
field manuals and local histories. Students and
teachers collect data from sites located along
the Hackensack River Watershed. Although this
data includes measurements such as salinity,
dissolved oxygen, pH, nitrates, temperature,
water velocity and turbidity. The
HMDC Environmental Operations Research Laboratory
provides technical and advisory support. This is
the end of HMDC sample paragraph.
Portion of interest
11
Related Work - Changes to Schema
  • Nestorov ref and Abiteboul ref
  • Use edge-label graph to model schema for
    semi-structured documents
  • Store semantic information by labeling the edges
  • Limitation
  • Labeling the edges is perform manually
  • Not all characteristics of schema for
    semi-structured documents are represented
  • E.g., no notion of ordering, of
    mandatory/optional nodes, etc.

12
Outline
  • Motivation
  • Related work
  • A model for semi-structured documents
  • Approach for detecting changes to data
  • Approaches for detecting changes to schema
  • Periodic, immediate and incorporating errors
  • Implementation - LLNL DataFoundry
  • Research contributions future directions

13
Overview of our approach
  • For detecting data and schema changes

?
?
New version
Schema
Old version
Generate Document Graph
Document graph
Generate Schema Graph
Parse using Value-add. Graph
Schema graph
Value-added graph
Merge
?
New doc
Schema
Generate Schema Graph
Parse using Schema Graph
Schema graph
14
Semi-structured Documents
  • Characteristics
  • Irregular structure - different representation
    missing structuring
  • Implicit structure - different interpretation
    required parsing
  • Indicative structure no strict typing policy
  • Human data entry errors may be introduced
  • A semi-structured document consists of
  • Data objects
  • Mandatory objects - must exist in the documents
  • Optional objects - may or may not exist in the
    documents
  • Relationship
  • Mandatory relationships - if one data object
    exists, the other must also exist
  • Optional relationships - If one data object
    exists, the other does not need to exist
  • Ordering
  • Significant ordering - proper ordering among a
    set of data objects must be preserved
  • Insignificant ordering - no strict ordering is
    preserved for a set of data objects
  • Repeatability
  • Non-repeatable object - 0 or 1 occurrence of an
    object may exist
  • Repeatable object - 2 or more occurrences of an
    object may exist

15
Model
  • We use directed graph to model data and schema
  • Consisting of nodes and directed edges
  • Representing schema for semi-structured documents
  • Data objects
  • Mandatory objects - regular node
  • Optional objects - optional node
  • Relationship
  • Mandatory relationships - a single directed edge
    exists between two nodes
  • Optional relationships - at least one optional
    node exists between the two nodes
  • Ordering
  • Significant ordering - solid edges are used to
    connect parent and its children
  • Insignificant ordering - dashed edges are used to
    connect parent and its children
  • Repeatability
  • Non-repeatable object - no looping directed edge
    (repeatable edge) is used
  • Repeatable object - a repeatable edge is used
  • ( Stopping node is used as a no-op for
    optional node
  • and terminator for repeatable node)

a
b
s
e
d
d
c
16
Model Schema Graph (contd)
  • E.g., simplified schema graph for PDB documents

17
Model Document Graph
  • View content of a document as instance of schema
  • E.g.,

Document graph representation
Content of scientific doc
18
Model Value Added Schema Graph
  • Instantiated based on an older version of
    document
  • To construct
  • Parse the older version using schema graph
  • Copy value of objects to the corresponding node
    on the schema graph
  • Extend optional node as necessary to handle loops
  • E.g.,

Value-added schema graph
19
Outline
  • Motivation
  • Related work
  • A model for semi-structured documents
  • Approach for detecting changes to data
  • Approaches for detecting changes to schema
  • Periodic, immediate and incorporating errors
  • Implementation - LLNL DataFoundry
  • Research contributions future directions

20
Detecting Changes to Data
  • Algorithm
  • Find the corresponding matching nodes in the
    documents and the value-added schema graph while
    parsing
  • If the matching nodes are
  • ordered nodes,
  • compare their values ? value update
  • unordered nodes,
  • compare their values ? value update
  • compare their position with respects to their
    siblings ? reorder
  • non-repeatable optional nodes,
  • compare their identifier ? deletion insertion
  • for equal identifier, compare their values ?
    value update
  • repeatable optional nodes, create two lists
    starting from the matching sub-root
  • sequentially compare the two lists
  • for matching nodes, compare their values ? value
    update
  • for matching node, identify their location in
    the list ? move
  • for non-matching node in the schema graph/doc ?
    deletion/insertion

21
Outline
  • Motivation
  • Related work
  • A model for semi-structured documents
  • Approach for detecting changes to data
  • Approaches for detecting changes to schema
  • Periodic, immediate and incorporating errors
  • Implementation - LLNL DataFoundry
  • Research contributions future directions

22
Detecting Changes to Schema
  • Types of schema changes
  • Reorder of a set of nodes in the schema graph
  • Insertion of a new node
  • Deletion of an existing node
  • Renaming existing nodes
  • Addition of a new repeatable edge
  • Deletion of an existing repeatable edge
  • Conversion of significant into insignificant
    ordering
  • Conversion of insignificant into significant
    ordering
  • Two cases - to simplify the problem
  • Non-existence of human errors during data entry
  • Existence of human errors during data entry -
    misinterpretation of schema changes
  • Approaches
  • Periodic - transforming current schema into new
    schema after evaluating a set of documents
  • Immediate - progressively transforming current
    schema after evaluating each new document

23
Periodic Vs Immediate
24
Periodic Approach
  • Algorithm
  • While parsing, for each node in the document
  • The matching node is discovered in the
    value-added schema graph
  • If the node is
  • regular node, their children are compared
  • Partially matching nodes ? rename
  • Unmatched nodes on the value-added graph ?
    deletion
  • Unmatched nodes on the document ? insertion
  • optional node, children of the node in the schema
    are compared with the corresponding sub-tree in
    the document
  • Nodes never discovered before ? new/renamed
    nodes
  • Nodes appearing more than once ? repeatable
    nodes
  • Nodes appearing before the last node on the
    sub-tree ? repeatable nodes
  • Node appearing only at the end of the sub-tree ?
    non-repeatable node

25
Immediate Approach
  • Detect schema changes based on discovered events
  • Event discovery of a new node, renaming
    existing node, different ordering, missing
    mandatory node, appearance of mandatory node,
    missing optional node, appearance of optional
    node, single appearance of repeatable node,
    multiple appearance of repeatable node, multiple
    appearance of non-repeatable node
  • Conclusion
  • Terminal conclusion - certain conclusion
    relevant conclusion
  • Incomplete conclusion - insufficient events to
    derive final conclusion
  • More new documents need to convert incomplete
    into terminal conclusion
  • E.g., event discovery of a new node

26
Complete event-decision diagram
27
Incorporating Data Entry Errors
  • Types of Errors
  • Node
  • A mandatory node is mistakenly excluded from new
    document
  • A mandatory node is mistakenly renamed with
    different identifier
  • A removed mandatory node is mistakenly reappeared
  • An optional node, which is supposed to appear, is
    mistakenly excluded
  • An optional node, which is supposed to be
    excluded, appears
  • Ordering
  • A significant ordering is mistakenly reordered
  • An old-removed ordering reappears
  • Repeatable edge
  • A non-repeatable node mistakenly appears more
    than once

28
Incorporating Data Entry Errors (contd)
  • Periodic approach
  • After a significantly large collection of
    documents are evaluated for schema changes, a
    straightforward statistical measurement can be
    used to incorporate errors
  • E.g., a mandatory node
  • Stay mandatory if it appears in 90 or more of
    the new documents
  • 10 is used to incorporate errors
  • Have been converted into optional if it appears
    in 30-70 of the new docs
  • 30 is used to incorporate errors
  • Have been removed from the new schema if it
    appears in lt3 of new docs

29
Incorporating Data Entry Errors (contd)
  • Immediate approach
  • Adopt Bayesian Sequential Analysis Procedure to
    incorporate potential errors
  • Decision criteria posterior probability or
    utility value
  • Parameters
  • Different possible actions (terminal decisions
    that can be taken)
  • Different state of the world
  • Prior probability value
  • Likelihood probability
  • Utility value
  • Cost of evaluating a new document

30
Posterior computation for evaluating two docs
31
Outline
  • Motivation
  • Related work
  • A model for semi-structured documents
  • Approach for detecting changes to data
  • Approaches for detecting changes to schema
  • Periodic, immediate and incorporating errors
  • Implementation - LLNL DataFoundry
  • Research contributions future directions

32
Implementation
  • DataFoundry at LLNL
  • Objective to provide LLNL scientists with a
    uniform and semantically consistent repository to
    a variety of biological data coming from several
    information sources
  • Environment
  • Distributed, autonomous, heterogeneous
    information sources
  • There are semantic and syntactic conflicts among
    information sources
  • The collection of biological documents are large
  • Changes to schema is frequent
  • To include 50 sources 1 schema change per year
    on the average!
  • Implementation detail
  • Implemented a web crawler/ingest engine using
    Perl, UnixScript, Internet tool
  • Periodically visit genome site and detect changes
    based on modification dates
  • Implemented data change detection - Lex/Yacc,
    C/C
  • In progress implementing schema change detection

33
Implementation (contd)
  • Change management system for LLNL
  • Architecture

34
Outline
  • Motivation
  • Related work
  • A model for semi-structured documents
  • Approach for detecting changes to data
  • Approaches for detecting changes to schema
  • Periodic, immediate and incorporating errors
  • Implementation - LLNL DataFoundry
  • Research contributions future directions

35
Research Contributions
  • Developed a model for semi-structured documents
  • Used the model to detect
  • Changes to data while parsing the new document
    efficiently
  • Performed during parsing the new documents, I.e.,
    cost of change detection is partly incurred by
    the cost of parsing
  • Changes to schema
  • Identified the types of schema changes
  • Proposed periodic and immediate approaches to
    detecting schema changes
  • Incorporate potential human errors during data
    entry

36
Future Direction
  • Performance study
  • Accuracy of constructing the new schema, which is
    significantly different from the current schema
  • For some schema changes
  • The current schema may be functional in term of
    parsing
  • Undetectable and a different approach may be
    required
  • For now, domain expert is used to detect
  • Example

37
Future Direction (contd)
  • Determining when change is to data, schema or
    both
  • Currently we determine that the schema has been
    changed by
  • monitoring the file modification date of
    free-formatted schema file made publicly
    available by the sources
  • waiting for the current parser to fail to
    function
  • some schema changes does not cause current parser
    to fail
  • Utilizing other statistical tools to detect
    changes to schema in the presence of errors
  • Making use of statistical distribution
  • Capable of detecting schema changes that can not
    be automatically detected using Bayesian
    Sequential Procedure
Write a Comment
User Comments (0)
About PowerShow.com