LearnPADS: Inferring Formats from Ad Hoc Data

About This Presentation

Title:

LearnPADS: Inferring Formats from Ad Hoc Data

Description:

LearnPADS: Inferring Formats from Ad Hoc Data. Kathleen Fisher. AT&T Labs ... 1 Kobe Bryant, LAL 77 40.8 31.6 10.6-22.8 .463 1.8-5.2 .344 8.7-10.0 .868 ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 16

Provided by: DPW9

Category:

more less

Transcript and Presenter's Notes

Title: LearnPADS: Inferring Formats from Ad Hoc Data

1
LearnPADS Inferring Formats from Ad Hoc Data

Kathleen Fisher
ATT Labs Research
David Walker
and
Kenny Zhu
Princeton University

2
Ad Hoc Data

Web server log
128.8.21.79 - - 15/Oct/1997203704 -0700 "GET
/amnesty/images/rally4.jpg HTTP/1.0" 200 7352
polux.entelchile.net - - 15/Oct/1997210207
-0700 "GET /latinam/spoeadp.html HTTP/1.0" 200
8540
polux.entelchile.net - - 15/Oct/1997210229
-0700 "GET /images/blkcnd2.gif HTTP/1.0" 200
1082
ws186.library.msstate.edu - - 15/Oct/199721024
4 -0700 "GET /amnesty/usa/assault.html HTTP/1.0"
200 4292
ATT phone provisioning data
0100029529120100029529117100164960019100164
9600271001649600291001649600IA02881001744
00IE02881001714400EDTF_CRTE1001908800EDTF_OS_
11001995201161021309814261054589982
9152271915227110000no_ii152271EDTF_10SC
1MF1FUNOEDTF_CRTE1001649600EDTF_OS_10100164
NBA score sheet
RNK NAME GP MPG PTS FGM-FGA
FG 3PM-3PA 3P FTM-FTA FT
1 Kobe Bryant, LAL 77 40.8
31.6 10.6-22.8 .463 1.8-5.2 .344
8.7-10.0 .868
2 C. Anthony, DEN 65 38.2 28.9
10.6-22.4 .476 0.6-2.3 .268 7.1-8.7
.808
Coral daemon log
1170306175.105858 "type""node",
"nid""f9f0d5514ab26c84535acbf81e9d2488a038349a--2
16.165.109.818089", "time"1170306175,
"srvrpc"5000734, "getsrvrpc"1453263,
"putsrvrpc"3365571
1170306175.105982 "type""cluster",
"cid""0000000000000000000000000000000000000000",
"level"0, "lsize"18, "size"214, "rtt"320058,
"ctime"0

3
Automatically Generate Tools from Data!

XML converter
Data profiler
Visualization tool
More

4
Simple End-to-End

Data Sources

Description
XML output
Punion payload Pint32 i PstringFW(3)
s2 Pstruct source \ payload
p1 , payload p2 \
0
24t
bar
end

0, 24
bar, end
foo, 16
5
Architecture
Raw Data
Format Inference
Tokenization
Structure Discovery
Format Refinement
Scoring Function
Data Description
6
Tokenization

Parse strings convert to symbolic tokens
Basic token set skewed towards systems data
Int, string, date, time, URLs, hostnames
A config file allows users to define their own
new token types via regular expressions

0, 24
INT , INT
tokenize
bar, end
STR , STR
foo, 16
STR , INT
7
Structure Discovery Overview
struct
candidate structure so far
?
discover
INT , INT
,

?
?
STR , STR
INT
INT
STR , INT
sources
STR
STR
STR
INT
8
Structure Discovery Details

Compute frequency distribution histogram for each
token.
(And recompute at every level of recursion).

INT , INT
STR , STR
STR , INT
percentage of sources
Number of occurrences per source
9
Structure Discovery Details

Cluster tokens with similar histograms into
groups
Classify the groups into
Structs Groups with high coverage low
residual mass
Arrays Groups with high coverage, sufficient
width high residual mass
Unions Other token groups
Pick group with strongest signal to divide and
conquer
Struct involving comma, quote identified in
histogram above

10
Format Refinement

Reanalyze source data with aid of rough
description and obtain functional dependencies
and constaints
Rewrite format description to
simplify presentation
merge rewrite structures
improve precision
add constraints (uniqueness, ranges, functional
dependencies)
fill in missing details
find completions where structure discovery
bottoms out
refine base types (integer sizes, array sizes,
seperators and terminators)
Rewriting is guided by local search that
optimizes an information-theoretic score

11
Scoring Function

Finding a function to evaluate the goodness of
a description involves balancing two ideas
a description must be concise
people cannot read and understand enormous
descriptions
a description must be precise
imprecise descriptions do not give us much useful
information
Note the trade-off
increasing precision (good) usually increases
description size (bad)
decreasing description size (good) usually
decreases precision (bad)
Minimum Description Length (MDL) Principle
Normalized Information-theoretic Scores

Transmission Bits BitsForDescription(T)
BitsForData(D given T)
12
LearnPADS On the Web
http//www.padsproj.org
13
End
14
Related Work

Most common domains for grammar inference
xml/html
natural language
Systems that focus on ad hoc data are rare and
those that do dont support PADS tool suite
Rufus system 93, TSIMMIS 94, Potters Wheel 01
Top-down structure discovery
Arasu Garcia-Molina 03 (extracting data from
web pages)
Grammar induction using MDL grammar rewriting
search
Stolcke and Omohundro 94 Inducing probabilistic
grammars...
T. W. Hong 02, Ph.D. thesis on information
extraction from web pages
Higuera 01 Current trends in grammar induction
Garofalakis et al. 00 XTRACT for infering DTDs