From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data

Description:

Where and how to get more training data? Is there a way to simplify the laborious manual labeling process? How to adjust influential weight of different data sources? ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 38
Provided by: DPW9
Category:

less

Transcript and Presenter's Notes

Title: From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data


1
From Dirt to ShovelsAutomatic Tool
Generationfrom Ad Hoc Data
  • Kenny Zhu
  • Princeton University

with Kathleen Fisher, David Walker and Peter White
2
A System Admins Life
3
Web Server Logs
4
System Logs
5
Application Configs
6
User Emails
7
Script Outputs and more
  • Not well documented
  • Poorly understood
  • No ready-made tools / libraries
  • Change from time to time!
  • Ad-hoc Data!

8
Automatically Generate Tools from Data!
  • XML converter
  • Data profiler
  • Grapher, etc.

CAGI 2007, POPL 2008, SIGMOD 2008
9
Architecture
XML converter
Raw Data
Profiler
Format Inference
Tokenization
Tokenization
LearnPADS
Structure Discovery
Structure Discovery
Format Refinement
Scoring Function
PADS Compiler
Format Refinement
Data Description
10
Simple PADS Description
  • Data Sources

Description
Punion payload Pint32 i PstringFW(3)
s2 Pstruct source \ payload
p1 , payload p2 \
0, 24
bar, end
foo, 16
  • Key points to know
  • Descriptions based on programming language
    types
  • Broad collection of base types (ints, strings,
    dates, ip addresses...)
  • Structured types includes structs, unions
    and arrays
  • has many other features dependency,
    constraints, recursion, ...
  • But can be complicated and error-prone to write
    by humans

11
Demo
12
Tokenization
  • Parse strings convert to symbolic tokens
  • Basic token set skewed towards systems data
  • Int, string, date, time, URLs, hostnames
  • A config file allows users to define their own
    new token types via regular expressions
  • Ptypedef Pstring_ME("/ \\t\\r\\n/") PPwhite
  • Ptypedef Pstring_ME("/0-9/") PPint
  • Ptypedef Pstring_ME("/(A-Za-zA-Za-z0-9_\\-)/
    ") PPstring

0, 24
INT , INT
tokenize
bar, end
STR , STR
foo, 16
STR , INT
13
Structure Discovery Overview
  • Top-down, divide-and-conquer algorithm
  • Compute various statistics from tokenized data
  • Guess a top-level description
  • Partition tokenized data into smaller chunks
  • Recursively analyze and compute descriptions from
    smaller chunks

14
Structure Discovery Overview
  • Top-down, divide-and-conquer algorithm
  • Compute various statistics from tokenized data
  • Guess a top-level description
  • Partition tokenized data into smaller chunks
  • Recursively analyze and compute descriptions from
    smaller chunks

candidate structure so far
struct
?
discover
,


?
?
INT , INT
INT
INT
STR , STR
STR
STR
sources
STR , INT
STR
INT
15
Structure Discovery Overview
  • Top-down, divide-and-conquer algorithm
  • Compute various statistics from tokenized data
  • Guess a top-level description
  • Partition tokenized data into smaller chunks
  • Recursively analyze and compute descriptions from
    smaller chunks

struct
struct
discover
,
,




?
?
?
union
INT
STR
INT
INT
INT
STR
STR
?
?
STR
INT
INT
STR
STR
16
Structure Discovery Details
  • Compute frequency distribution histogram for each
    token.
  • (And recompute at every level of recursion).

INT , INT
STR , STR
STR , INT
percentage of sources
Number of occurrences per source
17
Structure Discovery Details
  • Cluster tokens with similar histograms into
    groups
  • Similar histograms
  • tokens with strong regularity coexist in same
    description component
  • use symmetric relative entropy to measure
    similarity
  • Only the shape of the histogram matters
  • normalize histograms by sorting columns in
    descending size
  • result comma quote in one group, int string
    in another

18
Structure Discovery Details
  • Classify the groups into
  • Structs Groups with high coverage low
    residual mass
  • Arrays Groups with high coverage, sufficient
    width high residual mass
  • Unions Other token groups
  • Pick group with strongest signal to divide and
    conquer
  • More mathematical details in the POPL08
  • Struct involving comma, quote identified in
    histogram above
  • Overall procedure gives good starting point for
    refinement

19
Format Refinement
  • Reanalyze source data with aid of rough
    description and obtain functional dependencies
    and constaints
  • Rewrite format description to
  • simplify presentation
  • merge rewrite structures
  • improve precision
  • add constraints (uniqueness, ranges, functional
    dependencies)
  • fill in missing details
  • find completions where structure discovery
    bottoms out
  • refine base types (integer sizes, array sizes,
    seperators and terminators)

20
Format Refinement
  • Three main sub-phases
  • Phase 1 Tagging/Table generation
  • Convert rough description into tagged description
    relational table
  • Phase 2 Constraint inference
  • Analyze table and infer constraints
  • Find functional dependencies
  • Phase 3 Format rewriting
  • Use inferred constraints type isomorphisms to
    rewrite rough description
  • Greedy search to optimize information-theoretic
    score

21
Information Theoretic Scoring Function
  • Finding a function to evaluate the goodness of
    a description involves balancing two ideas
  • a description must be concise
  • people cannot read and understand enormous
    descriptions
  • a description must be precise
  • imprecise descriptions do not give us much useful
    information
  • Note the trade-off
  • increasing precision (good) usually increases
    description size (bad)
  • Punion of all records?
  • decreasing description size (good) usually
    decreases precision (bad)
  • Pstring?
  • Minimum Description Length (MDL) Principle
  • Normalized Information-theoretic Scores

Transmission Bits BitsForDescription(T)
BitsForData(D given T)
22
Refinement Simple Example
23
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
24
Evaluation
25
Benchmark Formats
Available at http//www.padsproj.org/
26
Execution Times
27
Training Time vs. Training Size
28
Training Accuracy vs Training Size
29
Related Work
  • Most common domains for grammar induction
  • xml/html inference
  • natural language processing
  • Systems that focus on ad hoc data are rare and
    those that do dont support PADS tool suite
  • Rufus system 93, TSIMMIS 94, Potters Wheel 01
  • Top-down structure discovery
  • Arasu Garcia-Molina 03 (extracting data from
    web pages)
  • Restricted languages
  • k-reversible language (Angluin 82)
  • SOREs and CHAREs (Bex et al. 06)
  • Approx. methods using MDL grammar rewriting
    search
  • Stolcke and Omohundro 94 Inducing probabilistic
    grammars...
  • T. W. Hong 02, Ph.D. thesis on information
    extraction from web pages
  • Garofalakis et al. 00 XTRACT for infering DTDs
  • Our Contributions
  • An end-to-end problem automatic generation of
    tools
  • An under-studied domain ad-hoc data
  • New top-down structure discovery algorithm

30
Current On-Going Work
31
Learning Tokenization Questions
  • Degree of success often hinges on the inference
    system having a tokenization scheme that matches
    the tokenization scheme of the data source.
  • Example 1
  • Is abc.tv/mypath/my script?a1b2
  • a filename, a filepath, or a URL?
  • or word dot word slash word slash
    word space word
  • Example 2
  • Is 85.0 a float or an int dot int?
  • Is 85 an int or a float?
  • Good tokens capture high-level, human
    abstractions compactly.
  • Current token definitions can be complex and
    ambiguous.
  • Fixed regex-based tokenization is not good
    enough!
  • Given a string of characters, find the best
    sequence of base tokens
  • Learn a probablistic tokenizer?

32
Learning Tokenization Initial Ideas
  • Token sequence is a first-order Hidden Markov
    Model (HMM)
  • Hidden states are tokens (one per character)
  • Observables are single characters from input
  • Each input character mapped to a feature vectors
    with features such as
  • is a digit?
  • is uppercase?
  • is lowercase?
  • is a puncturation?
  • is whitespace, etc.

33
Learning Tokenization Main Algorithm
  • Manually write PADS descriptions for large corpus
    of training data using a predefined set of base
    types
  • Parse the training data into base tokens
    (labeling)
  • Learn HMM parameters initial probabilities,
    transition probabilities and emission
    probabilities
  • Run HMM on test data to get most probably
    single-character state sequence using viterbi
    algorithm
  • Merge consecutive identical tokens together to
    get optimal token sequence

34
Conclusions
  • We are able produce XML and statistical reports
    fully automatically from ad hoc data sources.
  • Weve tested on approximately 16 real, mostly
    systemsy data sources (web logs, crash reports,
    ATT phone call data, etc.) with what we believe
    is a good success
  • We are currently studying the problem of
    probablistic tokenization
  • For papers, online demos pads software, see our
    website at

http//www.padsproj.org/
35
LearnPADS On the Web
36
End
37
Learning Tokenization Challenges
  • Training data
  • Where and how to get more training data?
  • Is there a way to simplify the laborious manual
    labeling process?
  • How to adjust influential weight of different
    data sources?
  • Categorize the training data sources?
  • Features
  • What is the best set of features to use?
  • Few features better clusters the data and easy
    to compute
  • More features better distinguish some ambigugous
    tokens, but requires larger training data sets
    and may result in no significant distribution
  • Another level of HMM to learn which features to
    use?
  • The Model
  • Is single character HMM good enough? Only up to
    256 features
  • Multi-character states give rise to more features
  • Maybe higher order HMM, HHMM, gHMM or CRF?
Write a Comment
User Comments (0)
About PowerShow.com