LearnPADS: Inferring Formats from Ad Hoc Data - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

LearnPADS: Inferring Formats from Ad Hoc Data

Description:

LearnPADS: Inferring Formats from Ad Hoc Data. Kathleen Fisher. AT&T Labs ... 1 Kobe Bryant, LAL 77 40.8 31.6 10.6-22.8 .463 1.8-5.2 .344 8.7-10.0 .868 ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 16
Provided by: DPW9
Category:

less

Transcript and Presenter's Notes

Title: LearnPADS: Inferring Formats from Ad Hoc Data


1
LearnPADS Inferring Formats from Ad Hoc Data
  • Kathleen Fisher
  • ATT Labs Research
  • David Walker
  • and
  • Kenny Zhu
  • Princeton University

2
Ad Hoc Data
  • Web server log
  • 128.8.21.79 - - 15/Oct/1997203704 -0700 "GET
    /amnesty/images/rally4.jpg HTTP/1.0" 200 7352
  • polux.entelchile.net - - 15/Oct/1997210207
    -0700 "GET /latinam/spoeadp.html HTTP/1.0" 200
    8540
  • polux.entelchile.net - - 15/Oct/1997210229
    -0700 "GET /images/blkcnd2.gif HTTP/1.0" 200
    1082
  • ws186.library.msstate.edu - - 15/Oct/199721024
    4 -0700 "GET /amnesty/usa/assault.html HTTP/1.0"
    200 4292
  • ATT phone provisioning data
  • 0100029529120100029529117100164960019100164
    9600271001649600291001649600IA02881001744
  • 00IE02881001714400EDTF_CRTE1001908800EDTF_OS_
    11001995201161021309814261054589982
  • 9152271915227110000no_ii152271EDTF_10SC
    1MF1FUNOEDTF_CRTE1001649600EDTF_OS_10100164
  • NBA score sheet
  • RNK NAME GP MPG PTS FGM-FGA
    FG 3PM-3PA 3P FTM-FTA FT
  • 1 Kobe Bryant, LAL 77 40.8
    31.6 10.6-22.8 .463 1.8-5.2 .344
    8.7-10.0 .868
  • 2 C. Anthony, DEN 65 38.2 28.9
    10.6-22.4 .476 0.6-2.3 .268 7.1-8.7
    .808
  • Coral daemon log
  • 1170306175.105858 "type""node",
    "nid""f9f0d5514ab26c84535acbf81e9d2488a038349a--2
    16.165.109.818089", "time"1170306175,
    "srvrpc"5000734, "getsrvrpc"1453263,
    "putsrvrpc"3365571
  • 1170306175.105982 "type""cluster",
    "cid""0000000000000000000000000000000000000000",
    "level"0, "lsize"18, "size"214, "rtt"320058,
    "ctime"0

3
Automatically Generate Tools from Data!
  • XML converter
  • Data profiler
  • Visualization tool
  • More

4
Simple End-to-End
  • Data Sources

Description
XML output
Punion payload Pint32 i PstringFW(3)
s2 Pstruct source \ payload
p1 , payload p2 \
0
24t
bar
end

0, 24
bar, end
foo, 16
5
Architecture
Raw Data
Format Inference
Tokenization
Structure Discovery
Format Refinement
Scoring Function
Data Description
6
Tokenization
  • Parse strings convert to symbolic tokens
  • Basic token set skewed towards systems data
  • Int, string, date, time, URLs, hostnames
  • A config file allows users to define their own
    new token types via regular expressions

0, 24
INT , INT
tokenize
bar, end
STR , STR
foo, 16
STR , INT
7
Structure Discovery Overview
struct
candidate structure so far
?
discover
INT , INT
,


?
?
STR , STR
INT
INT
STR , INT
sources
STR
STR
STR
INT
8
Structure Discovery Details
  • Compute frequency distribution histogram for each
    token.
  • (And recompute at every level of recursion).

INT , INT
STR , STR
STR , INT
percentage of sources
Number of occurrences per source
9
Structure Discovery Details
  • Cluster tokens with similar histograms into
    groups
  • Classify the groups into
  • Structs Groups with high coverage low
    residual mass
  • Arrays Groups with high coverage, sufficient
    width high residual mass
  • Unions Other token groups
  • Pick group with strongest signal to divide and
    conquer
  • Struct involving comma, quote identified in
    histogram above

10
Format Refinement
  • Reanalyze source data with aid of rough
    description and obtain functional dependencies
    and constaints
  • Rewrite format description to
  • simplify presentation
  • merge rewrite structures
  • improve precision
  • add constraints (uniqueness, ranges, functional
    dependencies)
  • fill in missing details
  • find completions where structure discovery
    bottoms out
  • refine base types (integer sizes, array sizes,
    seperators and terminators)
  • Rewriting is guided by local search that
    optimizes an information-theoretic score

11
Scoring Function
  • Finding a function to evaluate the goodness of
    a description involves balancing two ideas
  • a description must be concise
  • people cannot read and understand enormous
    descriptions
  • a description must be precise
  • imprecise descriptions do not give us much useful
    information
  • Note the trade-off
  • increasing precision (good) usually increases
    description size (bad)
  • decreasing description size (good) usually
    decreases precision (bad)
  • Minimum Description Length (MDL) Principle
  • Normalized Information-theoretic Scores

Transmission Bits BitsForDescription(T)
BitsForData(D given T)
12
LearnPADS On the Web
http//www.padsproj.org
13
End
14
Related Work
  • Most common domains for grammar inference
  • xml/html
  • natural language
  • Systems that focus on ad hoc data are rare and
    those that do dont support PADS tool suite
  • Rufus system 93, TSIMMIS 94, Potters Wheel 01
  • Top-down structure discovery
  • Arasu Garcia-Molina 03 (extracting data from
    web pages)
  • Grammar induction using MDL grammar rewriting
    search
  • Stolcke and Omohundro 94 Inducing probabilistic
    grammars...
  • T. W. Hong 02, Ph.D. thesis on information
    extraction from web pages
  • Higuera 01 Current trends in grammar induction
  • Garofalakis et al. 00 XTRACT for infering DTDs

15
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
Write a Comment
User Comments (0)
About PowerShow.com