From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data

Description:

Where and how to get more training data? Is there a way to simplify the laborious manual labeling process? How to adjust influential weight of different data sources? ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 38

Provided by: DPW9

Category:

more less

Transcript and Presenter's Notes

Title: From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data

1
From Dirt to ShovelsAutomatic Tool
Generationfrom Ad Hoc Data

Kenny Zhu
Princeton University

with Kathleen Fisher, David Walker and Peter White
2
A System Admins Life
3
Web Server Logs
4
System Logs
5
Application Configs
6
User Emails
7
Script Outputs and more

Not well documented
Poorly understood
No ready-made tools / libraries
Change from time to time!
Ad-hoc Data!

8
Automatically Generate Tools from Data!

XML converter
Data profiler
Grapher, etc.

CAGI 2007, POPL 2008, SIGMOD 2008
9
Architecture
XML converter
Raw Data
Profiler
Format Inference
Tokenization
Tokenization
LearnPADS
Structure Discovery
Structure Discovery
Format Refinement
Scoring Function
PADS Compiler
Format Refinement
Data Description
10
Simple PADS Description

Data Sources

Description
Punion payload Pint32 i PstringFW(3)
s2 Pstruct source \ payload
p1 , payload p2 \
0, 24
bar, end
foo, 16

Key points to know
Descriptions based on programming language
types
Broad collection of base types (ints, strings,
dates, ip addresses...)
Structured types includes structs, unions
and arrays
has many other features dependency,
constraints, recursion, ...
But can be complicated and error-prone to write
by humans

11
Demo
12
Tokenization

Parse strings convert to symbolic tokens
Basic token set skewed towards systems data
Int, string, date, time, URLs, hostnames
A config file allows users to define their own
new token types via regular expressions
Ptypedef Pstring_ME("/ \\t\\r\\n/") PPwhite
Ptypedef Pstring_ME("/0-9/") PPint
Ptypedef Pstring_ME("/(A-Za-zA-Za-z0-9_\\-)/
") PPstring

0, 24
INT , INT
tokenize
bar, end
STR , STR
foo, 16
STR , INT
13
Structure Discovery Overview

Top-down, divide-and-conquer algorithm
Compute various statistics from tokenized data
Guess a top-level description
Partition tokenized data into smaller chunks
Recursively analyze and compute descriptions from
smaller chunks

14
Structure Discovery Overview

Top-down, divide-and-conquer algorithm
Compute various statistics from tokenized data
Guess a top-level description
Partition tokenized data into smaller chunks
Recursively analyze and compute descriptions from
smaller chunks

candidate structure so far
struct
?
discover
,

?
?
INT , INT
INT
INT
STR , STR
STR
STR
sources
STR , INT
STR
INT
15
Structure Discovery Overview

Top-down, divide-and-conquer algorithm
Compute various statistics from tokenized data
Guess a top-level description
Partition tokenized data into smaller chunks
Recursively analyze and compute descriptions from
smaller chunks

struct
struct
discover
,
,

?
?
?
union
INT
STR
INT
INT
INT
STR
STR
?
?
STR
INT
INT
STR
STR
16
Structure Discovery Details

Compute frequency distribution histogram for each
token.
(And recompute at every level of recursion).

INT , INT
STR , STR
STR , INT
percentage of sources
Number of occurrences per source
17
Structure Discovery Details

Cluster tokens with similar histograms into
groups
Similar histograms
tokens with strong regularity coexist in same
description component
use symmetric relative entropy to measure
similarity
Only the shape of the histogram matters
normalize histograms by sorting columns in
descending size
result comma quote in one group, int string
in another

18
Structure Discovery Details

Classify the groups into
Structs Groups with high coverage low
residual mass
Arrays Groups with high coverage, sufficient
width high residual mass
Unions Other token groups
Pick group with strongest signal to divide and
conquer
More mathematical details in the POPL08
Struct involving comma, quote identified in
histogram above
Overall procedure gives good starting point for
refinement

19
Format Refinement

Reanalyze source data with aid of rough
description and obtain functional dependencies
and constaints
Rewrite format description to
simplify presentation
merge rewrite structures
improve precision
add constraints (uniqueness, ranges, functional
dependencies)
fill in missing details
find completions where structure discovery
bottoms out
refine base types (integer sizes, array sizes,
seperators and terminators)

20
Format Refinement

Three main sub-phases
Phase 1 Tagging/Table generation
Convert rough description into tagged description
relational table
Phase 2 Constraint inference
Analyze table and infer constraints
Find functional dependencies
Phase 3 Format rewriting
Use inferred constraints type isomorphisms to
rewrite rough description
Greedy search to optimize information-theoretic
score

21
Information Theoretic Scoring Function

Finding a function to evaluate the goodness of
a description involves balancing two ideas
a description must be concise
people cannot read and understand enormous
descriptions
a description must be precise
imprecise descriptions do not give us much useful
information
Note the trade-off
increasing precision (good) usually increases
description size (bad)
Punion of all records?
decreasing description size (good) usually
decreases precision (bad)
Pstring?
Minimum Description Length (MDL) Principle
Normalized Information-theoretic Scores

Transmission Bits BitsForDescription(T)
BitsForData(D given T)
22
Refinement Simple Example
23
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
24
Evaluation
25
Benchmark Formats
Available at http//www.padsproj.org/
26
Execution Times
27
Training Time vs. Training Size
28
Training Accuracy vs Training Size
29
Related Work

Most common domains for grammar induction
xml/html inference
natural language processing
Systems that focus on ad hoc data are rare and
those that do dont support PADS tool suite
Rufus system 93, TSIMMIS 94, Potters Wheel 01
Top-down structure discovery
Arasu Garcia-Molina 03 (extracting data from
web pages)
Restricted languages
k-reversible language (Angluin 82)
SOREs and CHAREs (Bex et al. 06)
Approx. methods using MDL grammar rewriting
search
Stolcke and Omohundro 94 Inducing probabilistic
grammars...
T. W. Hong 02, Ph.D. thesis on information
extraction from web pages
Garofalakis et al. 00 XTRACT for infering DTDs
Our Contributions
An end-to-end problem automatic generation of
tools
An under-studied domain ad-hoc data
New top-down structure discovery algorithm

30
Current On-Going Work
31
Learning Tokenization Questions

Degree of success often hinges on the inference
system having a tokenization scheme that matches
the tokenization scheme of the data source.
Example 1
Is abc.tv/mypath/my script?a1b2
a filename, a filepath, or a URL?
or word dot word slash word slash
word space word
Example 2
Is 85.0 a float or an int dot int?
Is 85 an int or a float?
Good tokens capture high-level, human
abstractions compactly.
Current token definitions can be complex and
ambiguous.
Fixed regex-based tokenization is not good
enough!
Given a string of characters, find the best
sequence of base tokens
Learn a probablistic tokenizer?

32
Learning Tokenization Initial Ideas

Token sequence is a first-order Hidden Markov
Model (HMM)
Hidden states are tokens (one per character)
Observables are single characters from input
Each input character mapped to a feature vectors
with features such as
is a digit?
is uppercase?
is lowercase?
is a puncturation?
is whitespace, etc.

33
Learning Tokenization Main Algorithm

Manually write PADS descriptions for large corpus
of training data using a predefined set of base
types
Parse the training data into base tokens
(labeling)
Learn HMM parameters initial probabilities,
transition probabilities and emission
probabilities
Run HMM on test data to get most probably
single-character state sequence using viterbi
algorithm
Merge consecutive identical tokens together to
get optimal token sequence

34
Conclusions

We are able produce XML and statistical reports
fully automatically from ad hoc data sources.
Weve tested on approximately 16 real, mostly
systemsy data sources (web logs, crash reports,
ATT phone call data, etc.) with what we believe
is a good success
We are currently studying the problem of
probablistic tokenization
For papers, online demos pads software, see our
website at

http//www.padsproj.org/
35
LearnPADS On the Web
36
End
37
Learning Tokenization Challenges

Training data
Where and how to get more training data?
Is there a way to simplify the laborious manual
labeling process?
How to adjust influential weight of different
data sources?
Categorize the training data sources?
Features
What is the best set of features to use?
Few features better clusters the data and easy
to compute
More features better distinguish some ambigugous
tokens, but requires larger training data sets
and may result in no significant distribution
Another level of HMM to learn which features to
use?
The Model
Is single character HMM good enough? Only up to
256 features
Multi-character states give rise to more features
Maybe higher order HMM, HHMM, gHMM or CRF?