Title: The PADS System version 2.0
1The PADS System (version 2.0)
Raw Data
XML
XMLifier
Profiler
Analysis Report
Format Inference
Tokenization
Structure Discovery
Structure Discovery
Data Description
Format Refinement
Scoring Function
PADS Compiler
Format Refinement
2Tokenization
- Parse strings convert to symbolic tokens
- Basic token set skewed towards systems data
- A config file allows users to define their own
new token types via regular expressions
0, 24
INT , INT
tokenize
bar, end
STR , STR
foo, 16
STR , INT
3Structure Discovery Overview
- Top-down, divide-and-conquer algorithm
- Compute various statistics from tokenized data
- Guess a top-level type constructor
- Partition tokenized data into smaller chunks
- Recursively analyze and compute types from
smaller chunks
4Structure Discovery Overview
- Top-down, divide-and-conquer algorithm
- Compute various statistics from tokenized data
- Guess a top-level type constructor
- Partition tokenized data into smaller chunks
- Recursively analyze and compute types from
smaller chunks
candidate structure so far
struct
?
discover
,
?
?
INT , INT
INT
INT
STR , STR
STR
STR
sources
STR , INT
STR
INT
5Structure Discovery Overview
- Top-down, divide-and-conquer algorithm
- Compute various statistics from tokenized data
- Guess a top-level type constructor
- Partition tokenized data into smaller chunks
- Recursively analyze and compute types from
smaller chunks
struct
struct
discover
,
,
?
?
?
union
INT
STR
INT
INT
INT
STR
STR
?
?
STR
INT
INT
STR
STR
6Structure Discovery Details
- Compute frequency distribution histogram for each
token. - (And recompute at every level of recursion).
INT , INT
STR , STR
STR , INT
percentage of sources
Number of occurrences per source
7Structure Discovery Details
- Cluster tokens into groups with similar
histograms - Similar histograms
- strong evidence tokens coexist in same
description component - use symmetric relative entropy to measure
similarity - Only the shape of the histogram matters
- normalize histograms by sorting columns in
descending size - result comma quote grouped together
8Structure Discovery Details
- Find most promising token group to divide and
conquer - Structs Groups with high coverage low
residual mass - Arrays Groups with high coverage, sufficient
width high residual mass - Unions Other token groups
- Struct involving comma, quote identified in
histogram above - Overall procedure gives good starting point for
rewriting system
9Format Refinement
- Reanalyze example data with aid of rough
description - Rewrite format description to
- simplify presentation
- merge rewrite structures
- improve precision
- reorganize description structure
- add constraints (sortedness, uniqueness, linear
relations, functional dependencies) - fill in missing details
- find completions where structure discovery
bottoms out - refine base types (termination conditions for
strings, integer sizes)
10Format Refinement
- Three main sub-phases
- Phase 1 Tagging/Table generation
- Convert rough description into tagged description
relational table - Phase 2 Constraint inference
- Analyze table and infer constraints
- Find functional dependencies
- Phase 3 Format rewriting
- Use inferred constraints type isomorphisms to
rewrite rough description - Greedy search to optimize information-theoretic
score
11Refinement Simple Example
120, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
13struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
union
union
structure discovery
int
alpha
int
alpha
14struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,
union
(id2)
union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
15struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,
union
(id2)
union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
constraint inference
id3 0 id1 id2 (first union is int
whenever second union is int)
16struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,
union
(id2)
union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
str
str (id4)
int
str
int (id5)
str
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
struct
constraint inference
union
id3 0 id1 id2 (first union is int
whenever second union is int)
struct
struct
rule-based structure rewriting
,
,
int
0
str
str
more accurate -- first int 0 -- rules out int
, alpha-string records
17Incomprehensible Type Theory Section of The Talk
18Evaluation
!
19Benchmark Formats
20Execution Times
21Training Time
22Training Accuracy
23Type Complexity and Min. Training Size
24Biggest Weakness
- Degree of success often hinges on the inference
system having a tokenization scheme that matches
the tokenization scheme of the data source. - Good tokens capture high-level, human
abstractions compactly. - Techniques for learning tokenizations from data
directly? - Techniques for using multiple, ambiguous
tokenization schemes simultaneously? - Qian Xi is looking at these problems with Kenny
I.
25Related Work
- Most common domains for grammar inference
- xml/html
- natural language
- Systems that focus on ad hoc data are rare and
those that do dont support PADS tool suite - Rufus system 93, TSIMMIS 94, Potters Wheel 01
- Top-down structure discovery
- Arasu Garcia-Molina 03 (extracting data from
web pages) - Grammar induction using MDL grammar rewriting
search - Stolcke and Omohundro 94 Inducing probabilistic
grammars... - T. W. Hong 02, Ph.D. thesis on information
extraction from web pages - Higuera 01 Current trends in grammar induction
26Conclusions
- Still a work in progress, but we are able to
produce XML and statistical reports fully
automatically from ad hoc data sources. - Weve tested on approximately 15 real, mostly
systemsy data sources (web logs, crash reports,
ATT phone call data, etc.) with what we believe
is relatively good success - For papers, online demos pads software, see our
website at
http//www.padsproj.org/
27End
28Normalized Information-theoretic Scores