The PADS System version 2.0 - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

The PADS System version 2.0

Description:

A config file allows users to define their own new token types via ... Tot (s) 5.0. 1.91. 0.11. Yum.txt. 1.5. 9.65. 0.37. Windowserver_last.log. 1.0. 3.24. 0.13 ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 29

Provided by: DPW9

Category:

more less

Transcript and Presenter's Notes

Title: The PADS System version 2.0

1
The PADS System (version 2.0)
Raw Data
XML
XMLifier
Profiler
Analysis Report
Format Inference
Tokenization
Structure Discovery
Structure Discovery
Data Description
Format Refinement
Scoring Function
PADS Compiler
Format Refinement
2
Tokenization

Parse strings convert to symbolic tokens
Basic token set skewed towards systems data
A config file allows users to define their own
new token types via regular expressions

0, 24
INT , INT
tokenize
bar, end
STR , STR
foo, 16
STR , INT
3
Structure Discovery Overview

Top-down, divide-and-conquer algorithm
Compute various statistics from tokenized data
Guess a top-level type constructor
Partition tokenized data into smaller chunks
Recursively analyze and compute types from
smaller chunks

4
Structure Discovery Overview

Top-down, divide-and-conquer algorithm
Compute various statistics from tokenized data
Guess a top-level type constructor
Partition tokenized data into smaller chunks
Recursively analyze and compute types from
smaller chunks

candidate structure so far
struct
?
discover
,

?
?
INT , INT
INT
INT
STR , STR
STR
STR
sources
STR , INT
STR
INT
5
Structure Discovery Overview

Top-down, divide-and-conquer algorithm
Compute various statistics from tokenized data
Guess a top-level type constructor
Partition tokenized data into smaller chunks
Recursively analyze and compute types from
smaller chunks

struct
struct
discover
,
,

?
?
?
union
INT
STR
INT
INT
INT
STR
STR
?
?
STR
INT
INT
STR
STR
6
Structure Discovery Details

Compute frequency distribution histogram for each
token.
(And recompute at every level of recursion).

INT , INT
STR , STR
STR , INT
percentage of sources
Number of occurrences per source
7
Structure Discovery Details

Cluster tokens into groups with similar
histograms
Similar histograms
strong evidence tokens coexist in same
description component
use symmetric relative entropy to measure
similarity
Only the shape of the histogram matters
normalize histograms by sorting columns in
descending size
result comma quote grouped together

8
Structure Discovery Details

Find most promising token group to divide and
conquer
Structs Groups with high coverage low
residual mass
Arrays Groups with high coverage, sufficient
width high residual mass
Unions Other token groups
Struct involving comma, quote identified in
histogram above
Overall procedure gives good starting point for
rewriting system

9
Format Refinement

Reanalyze example data with aid of rough
description
Rewrite format description to
simplify presentation
merge rewrite structures
improve precision
reorganize description structure
add constraints (sortedness, uniqueness, linear
relations, functional dependencies)
fill in missing details
find completions where structure discovery
bottoms out
refine base types (termination conditions for
strings, integer sizes)

10
Format Refinement

Three main sub-phases
Phase 1 Tagging/Table generation
Convert rough description into tagged description
relational table
Phase 2 Constraint inference
Analyze table and infer constraints
Find functional dependencies
Phase 3 Format rewriting
Use inferred constraints type isomorphisms to
rewrite rough description
Greedy search to optimize information-theoretic
score

11
Refinement Simple Example
12
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
13
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,

union
union
structure discovery
int
alpha
int
alpha
14
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,

union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
15
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,

union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
constraint inference
id3 0 id1 id2 (first union is int
whenever second union is int)
16
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,

union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
str
str (id4)
int
str
int (id5)
str
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
struct
constraint inference

union
id3 0 id1 id2 (first union is int
whenever second union is int)
struct
struct
rule-based structure rewriting
,
,
int
0
str
str
more accurate -- first int 0 -- rules out int
, alpha-string records
17
Incomprehensible Type Theory Section of The Talk
18
Evaluation
!
19
Benchmark Formats
20
Execution Times
21
Training Time
22
Training Accuracy
23
Type Complexity and Min. Training Size
24
Biggest Weakness

Degree of success often hinges on the inference
system having a tokenization scheme that matches
the tokenization scheme of the data source.
Good tokens capture high-level, human
abstractions compactly.
Techniques for learning tokenizations from data
directly?
Techniques for using multiple, ambiguous
tokenization schemes simultaneously?
Qian Xi is looking at these problems with Kenny
I.

25
Related Work

Most common domains for grammar inference
xml/html
natural language
Systems that focus on ad hoc data are rare and
those that do dont support PADS tool suite
Rufus system 93, TSIMMIS 94, Potters Wheel 01
Top-down structure discovery
Arasu Garcia-Molina 03 (extracting data from
web pages)
Grammar induction using MDL grammar rewriting
search
Stolcke and Omohundro 94 Inducing probabilistic
grammars...
T. W. Hong 02, Ph.D. thesis on information
extraction from web pages
Higuera 01 Current trends in grammar induction

26
Conclusions

Still a work in progress, but we are able to
produce XML and statistical reports fully
automatically from ad hoc data sources.
Weve tested on approximately 15 real, mostly
systemsy data sources (web logs, crash reports,
ATT phone call data, etc.) with what we believe
is relatively good success
For papers, online demos pads software, see our
website at

http//www.padsproj.org/
27
End
28
Normalized Information-theoretic Scores

Write a Comment

User Comments (0)