From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data - PowerPoint PPT Presentation

About This Presentation
Title:

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data

Description:

AT&T and other information technology companies spend huge amounts of time and ... subsetdef: goslim_goa 'GOA and proteome slim' [Term] id: GO:0000001 ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 27
Provided by: dav8190
Category:

less

Transcript and Presenter's Notes

Title: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data


1
From Dirt to ShovelsFully Automatic Tool
Generation from ASCII Data
David Walker Pamela Dragosh Mary
Fernandez Kathleen Fisher Andrew Forrest Bob
Gruber Yitzhak Mandelbaum Peter White Kenny Q.
Zhu
www.padsproj.org
2
Data, data, everywhere
  • ATT and other information technology companies
    spend huge amounts of time and energy processing
    ad hoc data
  • Ad hoc data data in non-standard formats with
    no a priori data processing tools/libraries
    available
  • not free text not html not xml
  • Common problems no documentation, evolving
    formats, huge volume, error-filled ...

Router Configs
Network Monitoring
Web Logs
Billing Info
Call Details
3
Data, data, everywhere
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - -
16/Oct/1997143222 -0700 "POST
/scpt/dd_at_grp.org/confirm HTTP/1.0" 200
941 234.200.68.71 - - 15/Oct/1997185333
-0700 "GET /tr/img/gift.gif HTTP/1.0 200
409 240.142.174.15 - - 15/Oct/1997183925
-0700 "GET /tr/img/wool.gif HTTP/1.0" 404
178 188.168.121.58 - - 16/Oct/1997125935
-0700 "GET / HTTP/1.0" 200
3082 214.201.210.19 ekf - 17/Oct/1997100823
-0700 "GET /img/new.gif HTTP/1.0" 304 -
web server common log format
4
Data, data, everywhere
91522729152272128136400922813640092281364009
22813640092no_ii152272EDTF_60MARVINS1UNO10
1000295291 915227291522721281364009228136400
9228136400922813640092no_ii15222EDTF_60MARV
INS1UNO101000295291201000295291171001649600
191001 649600271001649600291001649600IA028
81001714400IE02881001714400EDTF_CRTE100190880
0EDTF_OS_11001995201161021309814261054589982

ATT phone call provisioning data
5
Data, data, everywhere
HA00000000START OF TEST CYCLE aA00000001BXYZ
U1AB0000040000100B0000004200 HE00000005START OF
SUMMARY f 00000006NYZX B1QB00052000120000070000B00
0050000000520000 0049000000510000000100B0000000
5300000052500000535000 HF00000007END OF SUMMARY k
00000008LYXW B1KB0000065G0000009900100000001000020
000 HB00000009END OF TEST CYCLE
www.opradata.com
6
Data, data, everywhere
format-version 1.0 date 11112005
1424 auto-generated-by DAG-Edit 1.419 rev
3 default-namespace gene_ontology subsetdef
goslim_goa "GOA and proteome slim" Term id
GO0000001 name mitochondrion inheritance namespa
ce biological_process def "The distribution of
mitochondria\, including the mitochondrial
genome\, into daughter cells after mitosis or
meiosis\, mediated by interactions between
mitochondria and the cytoskeleton."
PMID10873824, PMID11389764, SGDmcc is_a
GO0048308 ! organelle inheritance is_a
GO0048311 ! mitochondrion distribution
www.geneontology.org
7
Goal
Visual Information
End-user tools
Billing Info
ASCII log files
Call Detail
Raw Data
CSV
XML
Standard formats schema
We want to create this arrow
8
Half-way there The PADS System 1.0 FG pldi 05,
FMW popl 06, MFWFG popl 07
Ad Hoc Data Source
PADS Data Description
PADS Runtime System (I/O, Error Handling)
PADS Compiler
Generated Libraries (Parsing, Printing, Traversal)
XML Converter
Data Profiler
Graphing Tool
Query Engine
Custom App
generic description- directed programs coded once
?
XML
Analysis Report
Graph
Information
9
PADS Language Overview
  • Rich base type library
  • integers Pint8, Puint32,
  • strings Pstring(), Pstring_FW(3), ...
  • systems data Pdate, Ptime, Pip,
  • Type constructors describe complex data sources
  • sequences Pstruct, Parray,
  • choices Punion, Penum, Pswitch
  • constraints arbitrary predicates describe
    expected semantic properties
  • parameterization allows definition of generic
    descriptions

Data formats are described using a specialized
language of types
A formal semantics gives meaning to descriptions
in terms of both external format and internal
data structures generated.
10
The Last Mile The PADS System 2.0
Raw Data
XML
XMLifier
Profiler
Analysis Report
Format Inference Engine
Chunking Tokenization
Chunking Tokenization
Structure Discovery
Structure Discovery
PADS Data Description
Format Refinement
Scoring Function
PADS Compiler
11
Chunking Process
  • Convert raw input into sequence of chunks.
  • Supported divisions
  • Various forms of newline
  • File boundaries
  • Also possible user-defined paragraphs

12
Tokenization
  • Tokens/Base types expressed as regular
    expressions.
  • Basic tokens
  • Integer, white space, punctuation, strings
  • Distinctive tokens
  • IP addresses, dates, times, MAC addresses, ...

13
Histograms
14
Clustering
Group clusters with similar frequency
distributions
Cluster 1
Cluster 2
Cluster 3
Two frequency distributions are similar if they
have the same shape (within some error tolerance)
when the columns are sorted by height.
Rank clusters by metric that rewards high
coverage and narrower distributions. Chose
cluster with highest score.
15
Partition chunks
In our example, all the tokens appear in the same
order in all chunks, so the union is degenerate.
16
Find subcontexts
Tokens in selected cluster Quote(2) Comma White
17
Then Recurse...
18
Inferred type
19
Structure Discovery Review
  • Compute frequency distribution for each token.
  • Cluster tokens with similar frequency
    distributions.
  • Create hypothesis about data structure from
    cluster distributions
  • Struct
  • Array
  • Union
  • Basic type (bottom out)
  • Partition data according to hypothesis recurse

123, 24 345, begin 574, end 9378, 56 12,
middle -12, problem
20
Testing and Evaluation
  • Evaluated overall results qualitatively
  • Compared with Excel -- a manual process with
    limited facilities for representation of
    hierarchy or variation
  • Compared with hand-written descriptions -
    performance variable depending on tokenization
    choices complexity
  • Evaluated accuracy quantitatively
  • For many formats 95 accuracy from 5 of
    available data
  • Evaluated performance quantitatively
  • Hours to days to hand-write formats
  • after fixing the format, appears to scale
    linearly with data size
  • lt1 min on 300K data

21
Technical Summary www.padsproj.org
  • PADS 1.0 is an effective implementation framework
    for many data processing tasks
  • PADS 2.0 improves programmer productivity further
    by automatically inferring formats generating
    many tools libraries

Email
struct ........ ...... ...........
ASCII log files
Binary Traces
CSV
XML
22
End
23
Execution Time
Data source SD (s) Ref (s) Tot (s) HW (h)
1967Transactions.short 0.20 2.32 2.56 4.0
MER_T01_01.cvs 0.11 2.82 2.92 0.5
Ai.3000 1.97 26.35 28.64 1.0
Asl.log 2.90 52.07 55.26 1.0
Boot.log 0.11 2.40 2.53 1.0
Crashreporter.log 0.12 3.58 3.73 2.0
Crashreporter.log.mod 0.15 3.83 4.00 2.0
Sirius.1000 2.24 5.69 8.00 1.5
Ls-l.txt 0.01 0.10 0.11 1.0
Netstat-an 0.07 0.74 0.82 1.0
Page_log 0.08 0.55 0.65 0.5
quarterlypersonalincome 0.07 5.11 5.18 48
Railroad.txt 0.06 2.69 2.76 2.0
Scrollkeeper.log 0.13 3.24 3.40 1.0
Windowserver_last.log 0.37 9.65 10.07 1.5
Yum.txt 0.11 1.91 2.03 5.0
SD structure discovery Ref
refinement Tot total HW hand-written
24
Training Time
25
Minimum Necessary Training Sizes
Data source 90 95
Sirius.1000 5 10
1967Transaction.short 5 5
Ai.3000 5 10
Asl.log 5 10
Scrollkeeper.log 5 5
Page_log 5 5
MER_T01_01.csv 5 5
Crashreporter.log 10 15
Crashreporter.log.mod 5 15
Windowserver_last.log 5 15
Netstat-an 25 35
Yum.txt 30 45
quarterlypersonalincome 10 10
Boot.log 45 60
Ls-l.txt 50 65
Railroad.txt 60 75
26
Problem Tokenization
  • Technical problem
  • Different data sources assume different
    tokenization strategies
  • Useful token definitions sometimes overlap, can
    be ambiguous, arent always easily expressed
    using regular expressions
  • Matching tokenization of underlying data source
    can make a big difference in structure discovery.
  • Current solution
  • Parameterize learning system with customizable
    configuration files
  • Automatically generate lexer file basic token
    types
  • Future solutions
  • Use existing PADS descriptions and data sources
    to learn probabilistic tokenizers
  • Incorporate probabilities into sophisticated
    back-end rewriting system
  • Back end has more context for making final
    decisions than the tokenizer, which reads 1
    character at a time without look ahead
Write a Comment
User Comments (0)
About PowerShow.com