From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data - PowerPoint PPT Presentation

About This Presentation

Title:

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data

Description:

AT&T and other information technology companies spend huge amounts of time and ... subsetdef: goslim_goa 'GOA and proteome slim' [Term] id: GO:0000001 ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 27

Provided by: dav8190

Learn more at: https://www.cs.princeton.edu

Category:

more less

Transcript and Presenter's Notes

Title: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data

1
From Dirt to ShovelsFully Automatic Tool
Generation from ASCII Data
David Walker Pamela Dragosh Mary
Fernandez Kathleen Fisher Andrew Forrest Bob
Gruber Yitzhak Mandelbaum Peter White Kenny Q.
Zhu
www.padsproj.org
2
Data, data, everywhere

ATT and other information technology companies
spend huge amounts of time and energy processing
ad hoc data
Ad hoc data data in non-standard formats with
no a priori data processing tools/libraries
available
not free text not html not xml
Common problems no documentation, evolving
formats, huge volume, error-filled ...

Router Configs
Network Monitoring
Web Logs
Billing Info
Call Details
3
Data, data, everywhere
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - -
16/Oct/1997143222 -0700 "POST
/scpt/dd_at_grp.org/confirm HTTP/1.0" 200
941 234.200.68.71 - - 15/Oct/1997185333
-0700 "GET /tr/img/gift.gif HTTP/1.0 200
409 240.142.174.15 - - 15/Oct/1997183925
-0700 "GET /tr/img/wool.gif HTTP/1.0" 404
178 188.168.121.58 - - 16/Oct/1997125935
-0700 "GET / HTTP/1.0" 200
3082 214.201.210.19 ekf - 17/Oct/1997100823
-0700 "GET /img/new.gif HTTP/1.0" 304 -
web server common log format
4
Data, data, everywhere
91522729152272128136400922813640092281364009
22813640092no_ii152272EDTF_60MARVINS1UNO10
1000295291 915227291522721281364009228136400
9228136400922813640092no_ii15222EDTF_60MARV
INS1UNO101000295291201000295291171001649600
191001 649600271001649600291001649600IA028
81001714400IE02881001714400EDTF_CRTE100190880
0EDTF_OS_11001995201161021309814261054589982

ATT phone call provisioning data
5
Data, data, everywhere
HA00000000START OF TEST CYCLE aA00000001BXYZ
U1AB0000040000100B0000004200 HE00000005START OF
SUMMARY f 00000006NYZX B1QB00052000120000070000B00
0050000000520000 0049000000510000000100B0000000
5300000052500000535000 HF00000007END OF SUMMARY k
00000008LYXW B1KB0000065G0000009900100000001000020
000 HB00000009END OF TEST CYCLE
www.opradata.com
6
Data, data, everywhere
format-version 1.0 date 11112005
1424 auto-generated-by DAG-Edit 1.419 rev
3 default-namespace gene_ontology subsetdef
goslim_goa "GOA and proteome slim" Term id
GO0000001 name mitochondrion inheritance namespa
ce biological_process def "The distribution of
mitochondria\, including the mitochondrial
genome\, into daughter cells after mitosis or
meiosis\, mediated by interactions between
mitochondria and the cytoskeleton."
PMID10873824, PMID11389764, SGDmcc is_a
GO0048308 ! organelle inheritance is_a
GO0048311 ! mitochondrion distribution
www.geneontology.org
7
Goal
Visual Information
End-user tools
Billing Info
ASCII log files
Call Detail
Raw Data
CSV
XML
Standard formats schema
We want to create this arrow
8
Half-way there The PADS System 1.0 FG pldi 05,
FMW popl 06, MFWFG popl 07
Ad Hoc Data Source
PADS Data Description
PADS Runtime System (I/O, Error Handling)
PADS Compiler
Generated Libraries (Parsing, Printing, Traversal)
XML Converter
Data Profiler
Graphing Tool
Query Engine
Custom App
generic description- directed programs coded once
?
XML
Analysis Report
Graph
Information
9
PADS Language Overview

Rich base type library
integers Pint8, Puint32,
strings Pstring(), Pstring_FW(3), ...
systems data Pdate, Ptime, Pip,
Type constructors describe complex data sources
sequences Pstruct, Parray,
choices Punion, Penum, Pswitch
constraints arbitrary predicates describe
expected semantic properties
parameterization allows definition of generic
descriptions

Data formats are described using a specialized
language of types
A formal semantics gives meaning to descriptions
in terms of both external format and internal
data structures generated.
10
The Last Mile The PADS System 2.0
Raw Data
XML
XMLifier
Profiler
Analysis Report
Format Inference Engine
Chunking Tokenization
Chunking Tokenization
Structure Discovery
Structure Discovery
PADS Data Description
Format Refinement
Scoring Function
PADS Compiler
11
Chunking Process

Convert raw input into sequence of chunks.
Supported divisions
Various forms of newline
File boundaries
Also possible user-defined paragraphs

12
Tokenization

Tokens/Base types expressed as regular
expressions.
Basic tokens
Integer, white space, punctuation, strings
Distinctive tokens
IP addresses, dates, times, MAC addresses, ...

13
Histograms
14
Clustering
Group clusters with similar frequency
distributions
Cluster 1
Cluster 2
Cluster 3
Two frequency distributions are similar if they
have the same shape (within some error tolerance)
when the columns are sorted by height.
Rank clusters by metric that rewards high
coverage and narrower distributions. Chose
cluster with highest score.
15
Partition chunks
In our example, all the tokens appear in the same
order in all chunks, so the union is degenerate.
16
Find subcontexts
Tokens in selected cluster Quote(2) Comma White
17
Then Recurse...
18
Inferred type
19
Structure Discovery Review

Compute frequency distribution for each token.
Cluster tokens with similar frequency
distributions.
Create hypothesis about data structure from
cluster distributions
Struct
Array
Union
Basic type (bottom out)
Partition data according to hypothesis recurse

123, 24 345, begin 574, end 9378, 56 12,
middle -12, problem
20
Testing and Evaluation

Evaluated overall results qualitatively
Compared with Excel -- a manual process with
limited facilities for representation of
hierarchy or variation
Compared with hand-written descriptions -
performance variable depending on tokenization
choices complexity
Evaluated accuracy quantitatively
For many formats 95 accuracy from 5 of
available data
Evaluated performance quantitatively
Hours to days to hand-write formats
after fixing the format, appears to scale
linearly with data size
lt1 min on 300K data

21
Technical Summary www.padsproj.org

PADS 1.0 is an effective implementation framework
for many data processing tasks
PADS 2.0 improves programmer productivity further
by automatically inferring formats generating
many tools libraries

Email
struct ........ ...... ...........
ASCII log files
Binary Traces
CSV
XML
22
End
23
Execution Time
Data source SD (s) Ref (s) Tot (s) HW (h)
1967Transactions.short 0.20 2.32 2.56 4.0
MER_T01_01.cvs 0.11 2.82 2.92 0.5
Ai.3000 1.97 26.35 28.64 1.0
Asl.log 2.90 52.07 55.26 1.0
Boot.log 0.11 2.40 2.53 1.0
Crashreporter.log 0.12 3.58 3.73 2.0
Crashreporter.log.mod 0.15 3.83 4.00 2.0
Sirius.1000 2.24 5.69 8.00 1.5
Ls-l.txt 0.01 0.10 0.11 1.0
Netstat-an 0.07 0.74 0.82 1.0
Page_log 0.08 0.55 0.65 0.5
quarterlypersonalincome 0.07 5.11 5.18 48
Railroad.txt 0.06 2.69 2.76 2.0
Scrollkeeper.log 0.13 3.24 3.40 1.0
Windowserver_last.log 0.37 9.65 10.07 1.5
Yum.txt 0.11 1.91 2.03 5.0
SD structure discovery Ref
refinement Tot total HW hand-written
24
Training Time
25
Minimum Necessary Training Sizes
Data source 90 95
Sirius.1000 5 10
1967Transaction.short 5 5
Ai.3000 5 10
Asl.log 5 10
Scrollkeeper.log 5 5
Page_log 5 5
MER_T01_01.csv 5 5
Crashreporter.log 10 15
Crashreporter.log.mod 5 15
Windowserver_last.log 5 15
Netstat-an 25 35
Yum.txt 30 45
quarterlypersonalincome 10 10
Boot.log 45 60
Ls-l.txt 50 65
Railroad.txt 60 75
26
Problem Tokenization

Technical problem
Different data sources assume different
tokenization strategies
Useful token definitions sometimes overlap, can
be ambiguous, arent always easily expressed
using regular expressions
Matching tokenization of underlying data source
can make a big difference in structure discovery.
Current solution
Parameterize learning system with customizable
configuration files
Automatically generate lexer file basic token
types
Future solutions
Use existing PADS descriptions and data sources
to learn probabilistic tokenizers
Incorporate probabilities into sophisticated
back-end rewriting system
Back end has more context for making final
decisions than the tokenizer, which reads 1
character at a time without look ahead