Title: LearnPADS: Inferring Formats from Ad Hoc Data
1LearnPADS Inferring Formats from Ad Hoc Data
- Kathleen Fisher
- ATT Labs Research
- David Walker
- and
- Kenny Zhu
- Princeton University
2Ad Hoc Data
- Web server log
- 128.8.21.79 - - 15/Oct/1997203704 -0700 "GET
/amnesty/images/rally4.jpg HTTP/1.0" 200 7352 - polux.entelchile.net - - 15/Oct/1997210207
-0700 "GET /latinam/spoeadp.html HTTP/1.0" 200
8540 - polux.entelchile.net - - 15/Oct/1997210229
-0700 "GET /images/blkcnd2.gif HTTP/1.0" 200
1082 - ws186.library.msstate.edu - - 15/Oct/199721024
4 -0700 "GET /amnesty/usa/assault.html HTTP/1.0"
200 4292 - ATT phone provisioning data
- 0100029529120100029529117100164960019100164
9600271001649600291001649600IA02881001744 - 00IE02881001714400EDTF_CRTE1001908800EDTF_OS_
11001995201161021309814261054589982 - 9152271915227110000no_ii152271EDTF_10SC
1MF1FUNOEDTF_CRTE1001649600EDTF_OS_10100164 - NBA score sheet
- RNK NAME GP MPG PTS FGM-FGA
FG 3PM-3PA 3P FTM-FTA FT - 1 Kobe Bryant, LAL 77 40.8
31.6 10.6-22.8 .463 1.8-5.2 .344
8.7-10.0 .868 - 2 C. Anthony, DEN 65 38.2 28.9
10.6-22.4 .476 0.6-2.3 .268 7.1-8.7
.808 - Coral daemon log
- 1170306175.105858 "type""node",
"nid""f9f0d5514ab26c84535acbf81e9d2488a038349a--2
16.165.109.818089", "time"1170306175,
"srvrpc"5000734, "getsrvrpc"1453263,
"putsrvrpc"3365571 - 1170306175.105982 "type""cluster",
"cid""0000000000000000000000000000000000000000",
"level"0, "lsize"18, "size"214, "rtt"320058,
"ctime"0
3Automatically Generate Tools from Data!
- XML converter
- Data profiler
- Visualization tool
- More
4Simple End-to-End
Description
XML output
Punion payload Pint32 i PstringFW(3)
s2 Pstruct source \ payload
p1 , payload p2 \
0
24t
bar
end
0, 24
bar, end
foo, 16
5Architecture
Raw Data
Format Inference
Tokenization
Structure Discovery
Format Refinement
Scoring Function
Data Description
6Tokenization
- Parse strings convert to symbolic tokens
- Basic token set skewed towards systems data
- Int, string, date, time, URLs, hostnames
- A config file allows users to define their own
new token types via regular expressions
0, 24
INT , INT
tokenize
bar, end
STR , STR
foo, 16
STR , INT
7Structure Discovery Overview
struct
candidate structure so far
?
discover
INT , INT
,
?
?
STR , STR
INT
INT
STR , INT
sources
STR
STR
STR
INT
8Structure Discovery Details
- Compute frequency distribution histogram for each
token. - (And recompute at every level of recursion).
INT , INT
STR , STR
STR , INT
percentage of sources
Number of occurrences per source
9Structure Discovery Details
- Cluster tokens with similar histograms into
groups - Classify the groups into
- Structs Groups with high coverage low
residual mass - Arrays Groups with high coverage, sufficient
width high residual mass - Unions Other token groups
- Pick group with strongest signal to divide and
conquer - Struct involving comma, quote identified in
histogram above
10Format Refinement
- Reanalyze source data with aid of rough
description and obtain functional dependencies
and constaints - Rewrite format description to
- simplify presentation
- merge rewrite structures
- improve precision
- add constraints (uniqueness, ranges, functional
dependencies) - fill in missing details
- find completions where structure discovery
bottoms out - refine base types (integer sizes, array sizes,
seperators and terminators) - Rewriting is guided by local search that
optimizes an information-theoretic score
11Scoring Function
- Finding a function to evaluate the goodness of
a description involves balancing two ideas - a description must be concise
- people cannot read and understand enormous
descriptions - a description must be precise
- imprecise descriptions do not give us much useful
information - Note the trade-off
- increasing precision (good) usually increases
description size (bad) - decreasing description size (good) usually
decreases precision (bad) - Minimum Description Length (MDL) Principle
- Normalized Information-theoretic Scores
Transmission Bits BitsForDescription(T)
BitsForData(D given T)
12LearnPADS On the Web
http//www.padsproj.org
13End
14Related Work
- Most common domains for grammar inference
- xml/html
- natural language
- Systems that focus on ad hoc data are rare and
those that do dont support PADS tool suite - Rufus system 93, TSIMMIS 94, Potters Wheel 01
- Top-down structure discovery
- Arasu Garcia-Molina 03 (extracting data from
web pages) - Grammar induction using MDL grammar rewriting
search - Stolcke and Omohundro 94 Inducing probabilistic
grammars... - T. W. Hong 02, Ph.D. thesis on information
extraction from web pages - Higuera 01 Current trends in grammar induction
- Garofalakis et al. 00 XTRACT for infering DTDs
150, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33