DeepDetect: An Extensible System for Detecting Attribute Outliers - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

DeepDetect: An Extensible System for Detecting Attribute Outliers

Description:

... system for detecting artifacts. Provides common facilities for various artifact detection algorithms ... Better visualization of artifacts for user inspection ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 32
Provided by: Pete298
Category:

less

Transcript and Presenter's Notes

Title: DeepDetect: An Extensible System for Detecting Attribute Outliers


1
DeepDetect An Extensible System for Detecting
Attribute Outliers Duplicates in XML
  • Q. Peter Lau, Wynne Hsu, Judice Koh, Mong Li Lee
  • School of Computing
  • National University of Singapore

2
eXtensible Markup Language (XML)
  • Hierarchical organization
  • Semi-structured data
  • Related Elements are grouped together
  • Elements may not be similar
  • Elements may not be present
  • Elements may not have an ordering

3
XML
Transactions
Branch_Code
Type
Transactions
Branch_Code
Type
4
4
Checking
Savings
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
  • Hierarchical Organization
  • Relevant Transactions grouped under Accounts

4
XML
Transactions
Branch_Code
Type
Transactions
Branch_Code
Type
4
4
Checking
Savings
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
  • Different but related elements grouped together

5
XML
Transactions
Branch_Code
Type
Transactions
Branch_Code
Type
4
4
Checking
Savings
Transaction
Transaction
Transaction
Transaction
  • ltBankgt Element not present

6
XML
  • Increasing being used as a medium of data
    exchange on the WWW
  • Data is increasingly being stored in XML format
  • Native XML database systems
  • Example
  • UniProt database of Worldwide protein sequences
    (www.uniprot.org )

7
Artifacts in XML
  • Dirty data
  • Redundancies
  • Discrepancies
  • Errors
  • May manifest as
  • Duplicates
  • Attribute Outliers

8
Duplicates
  • Real world entities that are copies of one
    another
  • But sometimes duplicates are natural occurrences
  • User must direct what to look out for
  • Most data-cleaning work in XML focus on Duplicates

9
Duplicates
Transactions
Branch_Code
Type
Transactions
Branch_Code
Type
4
4
Checking
Checking
  • Green Duplicate transactions may be a natural
    occurrence
  • Red Duplicate accounts may be illegal

10
Attribute Outliers
  • Deviating Patterns
  • A univariate point that exhibits deviating
    correlation behavior with respect to other
    attributes
  • May indicate
  • Discrepancies, errors
  • Suspicious activity

11
Attribute Outliers
Transactions
Branch_Code
Type
Transactions
Branch_Code
Type
4
4
Checking
Savings
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
  • Left Transaction with deviating amount
  • Right Transaction with deviating destination
    Bank

12
Detecting Artifacts
  • Naïve Approach
  • Map the XML data to relational tables and apply
    detection techniques for relational data
  • Problems
  • Handling of missing attribute values
  • Hierarchical encoding is lost
  • Direct methods that work on XML should be used

13
DeepDetect
  • Extensible system for detecting artifacts
  • Provides common facilities for various artifact
    detection algorithms
  • Consists of modules that have different
    components
  • Currently detects
  • Duplicates
  • Attribute Outliers

14
DeepDetect General Flow
  • Load XML document
  • Specify discretization rules
  • Configure artifact detection algorithms
  • Run detection algorithms
  • Inspect results and make corrections
  • Export XML

15
GUI Module
  • Obtains user input for
  • Discretization Rules
  • Detection Parameters
  • Corrections
  • Presents results of detection algorithms

GUI
Specification
Data Preparation
Artifact Detection
16
Specification Module
  • Computes the discretization for values
  • Make the corrections specified by the user on
    export

GUI
Specification
Data Preparation
Artifact Detection
17
Data Preparation Module
  • Formats the XML document to a more convenient
    form
  • Extract the structure tree of an XML

GUI
Specification
Accounts
Data Preparation
Account ()
Artifact Detection
Transactions
Branch_Code
Type
Transaction ()
Amt
Type
Branch
18
Data Preparation Module
  • Index XML Tree
  • Similar to depth first order
  • Store depth

GUI
Specification
Accounts (1, 14)
Data Preparation
Account (2, 13)
Transactions (3, 12)
Artifact Detection
Transaction (4, 11)
Amt (5, 6)
Type (7, 8)
Bank (9, 10)
30
C
YZ
19
Artifact Detection Module
  • Contains the various detection algorithms
  • Duplicate
  • Attribute Outlier
  • Detection algorithms may make use of structure
    tree and indexed XML tree

GUI
Specification
Data Preparation
Artifact Detection
20
Duplicate Detection
  • Input User define entities XML data
  • Extract candidates and tuples
  • Tuple Amt, 50
  • Candidate Set of tuples
  • Score pair-wise tuple similarity
  • E.g. using edit distance
  • Classify duplicate candidates
  • Output Duplicate Clusters

21
Attribute Outlier Detection
Transactions
Branch_Code
Type
Transactions
Branch_Code
Type
4
4
Checking
Savings
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
Transaction
  • XML Hierarchy can be used to partition objects
    into subspaces in which to detect outliers
  • Uncovers outliers more relevant to their locality

22
Attribute Outlier Detection
  • Input XML data parameters
  • Generate Subspaces
  • Count Supports of projections on subspace
  • Score each potential outlier
  • Identify Outliers by computing thresholds per
    subspace
  • Output Outliers

23
Attribute Outlier Detection
  • Support for projection Amt/ 100, Type/C,
    Bank/YZ is 1
  • Support for projection Type/C, Bank/YZ is
    3
  • Outlier score for Amt w.r.t. Amt/ 100,
    Type/C, Bank/YZ is 1 / 3

24
Implementation
XML
Handler 2a
Handler 1
SAX
Handler 2b
  • SAX used instead of DOM
  • SAX event chaining
  • XML is further split into different element files
    and indexed by depth first traversal
  • Do not need to re-parse redundant elements using
    SAX

25
Features
  • Set discre-tization bins

26
Configure Duplicate Detection
27
Configure Attribute Outlier Detection
28
Viewing Duplicate Clusters
29
Viewing Attribute Outliers
30
Conclusion
  • Detecting artifacts in XML data require
    specialized techniques
  • The DeepDetect architecture provides common
    facilities for different detection algorithms
  • Future Work
  • Better visualization of artifacts for user
    inspection
  • The use of Aggregates to summarize relevant nodes
    at higher levels of abstraction for artifact
    detection

31
End
  • QA
Write a Comment
User Comments (0)
About PowerShow.com