Title: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data
1From Dirt to ShovelsFully Automatic Tool
Generation from ASCII Data
David Walker Pamela Dragosh Mary
Fernandez Kathleen Fisher Andrew Forrest Bob
Gruber Yitzhak Mandelbaum Peter White Kenny Q.
Zhu
www.padsproj.org
2Data, data, everywhere
- ATT and other information technology companies
spend huge amounts of time and energy processing
ad hoc data - Ad hoc data data in non-standard formats with
no a priori data processing tools/libraries
available - not free text not html not xml
- Common problems no documentation, evolving
formats, huge volume, error-filled ...
Router Configs
Network Monitoring
Web Logs
Billing Info
Call Details
3Data, data, everywhere
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - -
16/Oct/1997143222 -0700 "POST
/scpt/dd_at_grp.org/confirm HTTP/1.0" 200
941 234.200.68.71 - - 15/Oct/1997185333
-0700 "GET /tr/img/gift.gif HTTP/1.0 200
409 240.142.174.15 - - 15/Oct/1997183925
-0700 "GET /tr/img/wool.gif HTTP/1.0" 404
178 188.168.121.58 - - 16/Oct/1997125935
-0700 "GET / HTTP/1.0" 200
3082 214.201.210.19 ekf - 17/Oct/1997100823
-0700 "GET /img/new.gif HTTP/1.0" 304 -
web server common log format
4Data, data, everywhere
91522729152272128136400922813640092281364009
22813640092no_ii152272EDTF_60MARVINS1UNO10
1000295291 915227291522721281364009228136400
9228136400922813640092no_ii15222EDTF_60MARV
INS1UNO101000295291201000295291171001649600
191001 649600271001649600291001649600IA028
81001714400IE02881001714400EDTF_CRTE100190880
0EDTF_OS_11001995201161021309814261054589982
ATT phone call provisioning data
5Data, data, everywhere
HA00000000START OF TEST CYCLE aA00000001BXYZ
U1AB0000040000100B0000004200 HE00000005START OF
SUMMARY f 00000006NYZX B1QB00052000120000070000B00
0050000000520000 0049000000510000000100B0000000
5300000052500000535000 HF00000007END OF SUMMARY k
00000008LYXW B1KB0000065G0000009900100000001000020
000 HB00000009END OF TEST CYCLE
www.opradata.com
6Data, data, everywhere
format-version 1.0 date 11112005
1424 auto-generated-by DAG-Edit 1.419 rev
3 default-namespace gene_ontology subsetdef
goslim_goa "GOA and proteome slim" Term id
GO0000001 name mitochondrion inheritance namespa
ce biological_process def "The distribution of
mitochondria\, including the mitochondrial
genome\, into daughter cells after mitosis or
meiosis\, mediated by interactions between
mitochondria and the cytoskeleton."
PMID10873824, PMID11389764, SGDmcc is_a
GO0048308 ! organelle inheritance is_a
GO0048311 ! mitochondrion distribution
www.geneontology.org
7Goal
Visual Information
End-user tools
Billing Info
ASCII log files
Call Detail
Raw Data
CSV
XML
Standard formats schema
We want to create this arrow
8Half-way there The PADS System 1.0 FG pldi 05,
FMW popl 06, MFWFG popl 07
Ad Hoc Data Source
PADS Data Description
PADS Runtime System (I/O, Error Handling)
PADS Compiler
Generated Libraries (Parsing, Printing, Traversal)
XML Converter
Data Profiler
Graphing Tool
Query Engine
Custom App
generic description- directed programs coded once
?
XML
Analysis Report
Graph
Information
9PADS Language Overview
- Rich base type library
- integers Pint8, Puint32,
- strings Pstring(), Pstring_FW(3), ...
- systems data Pdate, Ptime, Pip,
- Type constructors describe complex data sources
- sequences Pstruct, Parray,
- choices Punion, Penum, Pswitch
- constraints arbitrary predicates describe
expected semantic properties - parameterization allows definition of generic
descriptions
Data formats are described using a specialized
language of types
A formal semantics gives meaning to descriptions
in terms of both external format and internal
data structures generated.
10The Last Mile The PADS System 2.0
Raw Data
XML
XMLifier
Profiler
Analysis Report
Format Inference Engine
Chunking Tokenization
Chunking Tokenization
Structure Discovery
Structure Discovery
PADS Data Description
Format Refinement
Scoring Function
PADS Compiler
11Chunking Process
- Convert raw input into sequence of chunks.
- Supported divisions
- Various forms of newline
- File boundaries
- Also possible user-defined paragraphs
12Tokenization
- Tokens/Base types expressed as regular
expressions. - Basic tokens
- Integer, white space, punctuation, strings
- Distinctive tokens
- IP addresses, dates, times, MAC addresses, ...
13Histograms
14Clustering
Group clusters with similar frequency
distributions
Cluster 1
Cluster 2
Cluster 3
Two frequency distributions are similar if they
have the same shape (within some error tolerance)
when the columns are sorted by height.
Rank clusters by metric that rewards high
coverage and narrower distributions. Chose
cluster with highest score.
15Partition chunks
In our example, all the tokens appear in the same
order in all chunks, so the union is degenerate.
16Find subcontexts
Tokens in selected cluster Quote(2) Comma White
17Then Recurse...
18Inferred type
19Structure Discovery Review
- Compute frequency distribution for each token.
- Cluster tokens with similar frequency
distributions. - Create hypothesis about data structure from
cluster distributions - Struct
- Array
- Union
- Basic type (bottom out)
- Partition data according to hypothesis recurse
123, 24 345, begin 574, end 9378, 56 12,
middle -12, problem
20Testing and Evaluation
- Evaluated overall results qualitatively
- Compared with Excel -- a manual process with
limited facilities for representation of
hierarchy or variation - Compared with hand-written descriptions -
performance variable depending on tokenization
choices complexity - Evaluated accuracy quantitatively
- For many formats 95 accuracy from 5 of
available data - Evaluated performance quantitatively
- Hours to days to hand-write formats
- after fixing the format, appears to scale
linearly with data size - lt1 min on 300K data
21Technical Summary www.padsproj.org
- PADS 1.0 is an effective implementation framework
for many data processing tasks - PADS 2.0 improves programmer productivity further
by automatically inferring formats generating
many tools libraries
Email
struct ........ ...... ...........
ASCII log files
Binary Traces
CSV
XML
22End
23Execution Time
Data source SD (s) Ref (s) Tot (s) HW (h)
1967Transactions.short 0.20 2.32 2.56 4.0
MER_T01_01.cvs 0.11 2.82 2.92 0.5
Ai.3000 1.97 26.35 28.64 1.0
Asl.log 2.90 52.07 55.26 1.0
Boot.log 0.11 2.40 2.53 1.0
Crashreporter.log 0.12 3.58 3.73 2.0
Crashreporter.log.mod 0.15 3.83 4.00 2.0
Sirius.1000 2.24 5.69 8.00 1.5
Ls-l.txt 0.01 0.10 0.11 1.0
Netstat-an 0.07 0.74 0.82 1.0
Page_log 0.08 0.55 0.65 0.5
quarterlypersonalincome 0.07 5.11 5.18 48
Railroad.txt 0.06 2.69 2.76 2.0
Scrollkeeper.log 0.13 3.24 3.40 1.0
Windowserver_last.log 0.37 9.65 10.07 1.5
Yum.txt 0.11 1.91 2.03 5.0
SD structure discovery Ref
refinement Tot total HW hand-written
24Training Time
25Minimum Necessary Training Sizes
Data source 90 95
Sirius.1000 5 10
1967Transaction.short 5 5
Ai.3000 5 10
Asl.log 5 10
Scrollkeeper.log 5 5
Page_log 5 5
MER_T01_01.csv 5 5
Crashreporter.log 10 15
Crashreporter.log.mod 5 15
Windowserver_last.log 5 15
Netstat-an 25 35
Yum.txt 30 45
quarterlypersonalincome 10 10
Boot.log 45 60
Ls-l.txt 50 65
Railroad.txt 60 75
26Problem Tokenization
- Technical problem
- Different data sources assume different
tokenization strategies - Useful token definitions sometimes overlap, can
be ambiguous, arent always easily expressed
using regular expressions - Matching tokenization of underlying data source
can make a big difference in structure discovery. - Current solution
- Parameterize learning system with customizable
configuration files - Automatically generate lexer file basic token
types - Future solutions
- Use existing PADS descriptions and data sources
to learn probabilistic tokenizers - Incorporate probabilities into sophisticated
back-end rewriting system - Back end has more context for making final
decisions than the tokenizer, which reads 1
character at a time without look ahead