The PADS System version 2.0 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

The PADS System version 2.0

Description:

A config file allows users to define their own new token types via ... Tot (s) 5.0. 1.91. 0.11. Yum.txt. 1.5. 9.65. 0.37. Windowserver_last.log. 1.0. 3.24. 0.13 ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 29
Provided by: DPW9
Category:
Tags: pads | system | tot | version

less

Transcript and Presenter's Notes

Title: The PADS System version 2.0


1
The PADS System (version 2.0)
Raw Data
XML
XMLifier
Profiler
Analysis Report
Format Inference
Tokenization
Structure Discovery
Structure Discovery
Data Description
Format Refinement
Scoring Function
PADS Compiler
Format Refinement
2
Tokenization
  • Parse strings convert to symbolic tokens
  • Basic token set skewed towards systems data
  • A config file allows users to define their own
    new token types via regular expressions

0, 24
INT , INT
tokenize
bar, end
STR , STR
foo, 16
STR , INT
3
Structure Discovery Overview
  • Top-down, divide-and-conquer algorithm
  • Compute various statistics from tokenized data
  • Guess a top-level type constructor
  • Partition tokenized data into smaller chunks
  • Recursively analyze and compute types from
    smaller chunks

4
Structure Discovery Overview
  • Top-down, divide-and-conquer algorithm
  • Compute various statistics from tokenized data
  • Guess a top-level type constructor
  • Partition tokenized data into smaller chunks
  • Recursively analyze and compute types from
    smaller chunks

candidate structure so far
struct
?
discover
,


?
?
INT , INT
INT
INT
STR , STR
STR
STR
sources
STR , INT
STR
INT
5
Structure Discovery Overview
  • Top-down, divide-and-conquer algorithm
  • Compute various statistics from tokenized data
  • Guess a top-level type constructor
  • Partition tokenized data into smaller chunks
  • Recursively analyze and compute types from
    smaller chunks

struct
struct
discover
,
,




?
?
?
union
INT
STR
INT
INT
INT
STR
STR
?
?
STR
INT
INT
STR
STR
6
Structure Discovery Details
  • Compute frequency distribution histogram for each
    token.
  • (And recompute at every level of recursion).

INT , INT
STR , STR
STR , INT
percentage of sources
Number of occurrences per source
7
Structure Discovery Details
  • Cluster tokens into groups with similar
    histograms
  • Similar histograms
  • strong evidence tokens coexist in same
    description component
  • use symmetric relative entropy to measure
    similarity
  • Only the shape of the histogram matters
  • normalize histograms by sorting columns in
    descending size
  • result comma quote grouped together

8
Structure Discovery Details
  • Find most promising token group to divide and
    conquer
  • Structs Groups with high coverage low
    residual mass
  • Arrays Groups with high coverage, sufficient
    width high residual mass
  • Unions Other token groups
  • Struct involving comma, quote identified in
    histogram above
  • Overall procedure gives good starting point for
    rewriting system

9
Format Refinement
  • Reanalyze example data with aid of rough
    description
  • Rewrite format description to
  • simplify presentation
  • merge rewrite structures
  • improve precision
  • reorganize description structure
  • add constraints (sortedness, uniqueness, linear
    relations, functional dependencies)
  • fill in missing details
  • find completions where structure discovery
    bottoms out
  • refine base types (termination conditions for
    strings, integer sizes)

10
Format Refinement
  • Three main sub-phases
  • Phase 1 Tagging/Table generation
  • Convert rough description into tagged description
    relational table
  • Phase 2 Constraint inference
  • Analyze table and infer constraints
  • Find functional dependencies
  • Phase 3 Format rewriting
  • Use inferred constraints type isomorphisms to
    rewrite rough description
  • Greedy search to optimize information-theoretic
    score

11
Refinement Simple Example
12
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
13
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,


union
union
structure discovery
int
alpha
int
alpha
14
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,



union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
15
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,



union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
constraint inference
id3 0 id1 id2 (first union is int
whenever second union is int)
16
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,



union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
str
str (id4)
int
str
int (id5)
str
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
struct
constraint inference


union
id3 0 id1 id2 (first union is int
whenever second union is int)
struct
struct
rule-based structure rewriting
,
,
int
0
str
str
more accurate -- first int 0 -- rules out int
, alpha-string records
17
Incomprehensible Type Theory Section of The Talk
18
Evaluation
!
19
Benchmark Formats
20
Execution Times
21
Training Time
22
Training Accuracy
23
Type Complexity and Min. Training Size
24
Biggest Weakness
  • Degree of success often hinges on the inference
    system having a tokenization scheme that matches
    the tokenization scheme of the data source.
  • Good tokens capture high-level, human
    abstractions compactly.
  • Techniques for learning tokenizations from data
    directly?
  • Techniques for using multiple, ambiguous
    tokenization schemes simultaneously?
  • Qian Xi is looking at these problems with Kenny
    I.

25
Related Work
  • Most common domains for grammar inference
  • xml/html
  • natural language
  • Systems that focus on ad hoc data are rare and
    those that do dont support PADS tool suite
  • Rufus system 93, TSIMMIS 94, Potters Wheel 01
  • Top-down structure discovery
  • Arasu Garcia-Molina 03 (extracting data from
    web pages)
  • Grammar induction using MDL grammar rewriting
    search
  • Stolcke and Omohundro 94 Inducing probabilistic
    grammars...
  • T. W. Hong 02, Ph.D. thesis on information
    extraction from web pages
  • Higuera 01 Current trends in grammar induction

26
Conclusions
  • Still a work in progress, but we are able to
    produce XML and statistical reports fully
    automatically from ad hoc data sources.
  • Weve tested on approximately 15 real, mostly
    systemsy data sources (web logs, crash reports,
    ATT phone call data, etc.) with what we believe
    is relatively good success
  • For papers, online demos pads software, see our
    website at

http//www.padsproj.org/
27
End
28
Normalized Information-theoretic Scores
Write a Comment
User Comments (0)
About PowerShow.com