Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data

Description:

Table 1-9: ADA-Accessible Rail Transit Stations by Agency,Type of rail transit / agency,Primary city served,Number of stations,Number of ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 45

Provided by: Qian68

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data

1
Statistical Magic Progress in Automatic Tool
Generationfor Ad Hoc Data

Qian Xi
2008/5/13

Joint work with Professor David Walker, Kathleen
Fisher (ATT) Kenny Zhu
2
Ad Hoc Data

Standardized data formats HTML, XML
Data processing tools Visualizers (HTML
browsers), XQuery

Non-standard, semi-structured
Not many data processing tools
Examples web server log (CLF), phone call
provisioning data, train schedule, stock
trading info

Table 1-9 ADA-Accessible Rail Transit Stations
by Agency,,,,,,,,,,,,,,, Type of rail transit /
agency,Primary city served,Number of
stations,,,,,,,Number of ADA-accessible
stations,,,,,, ,,1996,1997,1998,1999,2000,2001,200
2,1996,1997,1998,1999,2000,2001,2002 Heavy
rail,,,,,,,,,,,,,,, Bay Area Rapid Transit,"San
Francisco, CA",36,39,39,39,39,39,39,36,39,39,39,39
,39,39 Los Angeles County Metropolitan
Transportation Authority ,"Los Angeles,
CA",5,8,8,13,16,16,16,5,8,8,13,16,16,16
91522729152272128136400922813640092281364009
22813640092no_ii152272EDTF_60MARVINS1UNO10
1000295291 915227291522721281364009228136400
9228136400922813640092no_ii15222EDTF_60MARV
INS1UNO101000295291201000295291171001649600
191001
YHOO YAHOO INC 23.93 1015AM ET 4.74 (16.53)
103,880,260 MSFT MICROSOFT CP 29.93 1015AM ET
0.69 (2.36) 39,165,715 SPY SP DEP RECEIPTS
141.52 1010AM ET 0.01 (0.01) 20,723,717 QQQQ
POWERSHARES QQQ TR 148.88 1015AM ET 0.11
(0.23) 22,074,278 CSCO CISCO SYS INC 26.68
1015AM ET 0.07 (0.26) 13,934,552 CFC
COUNTRYWIDE FNL CP 5.36 1010AM ET 0.62 (10.37)
13,019,603
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200
- - 16/Oct/1997143222 -0700 "POST
/scpt/ddorg/confirm HTTP/1.0" 200 941
1/33
3
Analytical Tasks

Format Converters XML converter
Statistical Analyzers
Which pages on the website are visited most
frequently?
Visualizers

2/33
graph from http//www.data360.org
4
learnPADS Goal

Automatically generates a description of the
format
Automatically generates a suite of data
processing tools

XML converter
0,24 bar,end foo,16
Grapher
...
3/33
5
learnPADS Architecture
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
4/33
Slide credit Kenny Zhu, POPL08 talk
6
learnPADS Architecture
XML converter
Raw Data
Profiler
Chunking Tokenization
Chunking Tokenization
Format Inference Engine
0, 24 bar, end foo, 16
Quote Int Comma Int Quote Quote String Comma
String Quote Quote String Comma Int Quote
Quote (String Int) Comma (String Int)
Quote
Structure Discovery
Structure Discovery
Format Refinement
PADS Compiler
Data Description
5/33
7
Motivation Token Ambiguity Problem (TAP)

Given a string, therere multiple ways to
tokenize it.
Example 1 127.0.0.1
IP
Float Dot Float
Int Dot Int Dot Int Dot Int
Example 2

Message
Word White Word White Word White... White URL
Word White Quote Filepath Quote White Word
White...

6/33
8
How Does learnPADS deal with TAP ?

Tokenization Phase
Take the first, longest match.

Float

A fixed order is assigned by the end user.
We have no order to pick.

Int
ID
Path
As a result, the current learning system cant
have ambiguous base tokens Message, Text,
ID. sometimes produces descriptions that are too
precise.
7/33
9
A Concrete Example Tokenization By Lex
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
8/33
10
Inspiration

Human distinguish tokens by their background
knowledge
Purl usually starts from http//
Pdate March with 123055 nearby
Ptext and Pmessage long, sometimes go to the
end of a line
This knowledge can be encoded in statistical
models.
Statistical models are very successful in
Natural Language Processing, Speech Recognition.

9/33
11
learnPADS Architecture Recall
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
10/33
12
Tokenization Problem Specification

Inputs
A set of tokens with regular expression
definitions
A collection of strings annotated with token
sequences
a tool to label the chunks automatically given a
description
A test string
Output
a valid, best token sequence for test string
Supervised learning problem

11/33
13
Token sequence representation Seqset

A valid token sequence for a given string
can parse the string
takes the longest match rule.
An example 0.1
valid Int Dot Int
invalid Float Dot Int
Seqset the directed acyclic graph (DAG) to
represent all possible token sequences of a
chunk.
Vertices characters / positions
Edges tokens from the pos after the starts to
the pos of the ends
String with 26 chars 14,521,680 token sequences

Float
Int
Dot
Int
S
0
.
1
Float
Float
12/33
14
2-step Tokenization Algorithm

Given token definitions of regular expressions,
construct the Seqsets.
For each record, find the most likely token
sequence
a Hidden Markov Model
a Hierarchical Maximum-Entropy Model

13/33
15
Hidden Markov Model (HMM)

Observation ci
Character vs. Character Feature Vector
Character Features upper/lower case, digit,
punctuation...
Hidden state si token (partial token)

Quote
Word
Word
Word
Comma
Int
Int
Quote

f
o
o
,
1
6

14/33
16
Hidden Markov Model Parameters
15/33
17
Test Data Sources
16/33
18
HHM Discussion
Error rate percentage of tokens not identified
w.r.t. the labeled token sequences
lex fixed token priority, take the first,
longest match
17/33
19
More HMM Tokenization Results
18/33
20
Hierarchical Maximum-Entropy Model
Quote
Comma
Word
Quote
Int
,

foo

16
19/33
21
Hierarchical Max-Ent Model Discussion
Average log probabilities log likelihood /
length of token sequence
s qxi qxi_at_cs.princeton.edu 1.63 P( ID
qxi ) 0.8 P( White ) 1.0 P( Email
qxi_at_cs.princeton.edu ) 0.9 P( Float 1.63
) 0.9 P( White Others ) 0.7 P( Others
White ) 0.8 normal log P( ID White Email
White Float qxi qxi_at_cs.princeton.edu 1.63 )
0.8 1.0 0.9 1.0 0.9 0.7 0.8
0.7 0.8 0.203 P( Blob qxi
qxi_at_cs.princeton.edu 1.63) 0.3 average
log P( ID White Email White Float qxi
qxi_at_cs.princeton.edu 1.63 ) (-0.097)
(-0.046) (-0.046) (-0.155) (-0.097)
(-0.155) (-0.097)/5 -0.139 P( Blob qxi
qxi_at_cs.princeton.edu 1.63) -0.523
20/33
22
Hierarchical Max-Ent Model Discussion
Error rate percentage of tokens not identified
w.r.t. the labeled token sequences
lex
normal emission probabilities
average log emission probabilities
21/33
23
Hierarchical Max-Ent Model Results
22/33
24
Lex vs. HMM vs. Hierarchical Max-Ent

Ambiguity is increased.
Training corpus is not large enough.

23/33
25
learnPADS Architecture Recall
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
24/33
26
Structure Discovery Phase
0,24 bar,end foo,16
Struct, Union, Array
Quote Int Comma Int Quote Quote String Comma
String Quote Quote String Comma Int Quote
Quote Int Comma Int Quote Quote
String Comma String Quote Quote String Comma
Int Quote
Struct
Int String String
Int String Int
Quote
Comma
Union
Union
Quote
String
Int
String
Int
Struct classify chunks by token counts (Quote,
2), (Comma, 1) .
25/33
27
Extended Viterbi Algorithm
token counts (Quote, 2), (Comma, 1)
pos n
(Quote, 0), (Comma, 0)
pos 0
pos 1
(Quote, 0), (Comma, 0)
(Quote, 0), (Comma, 0)
Msg
Msg
(Quote, 1), (Comma, 0)
...
...
P_(0,0)_Txt
P_(0,0)_Msg
P_(0,0)_Txt
P_(0,0)_Msg
. . .
(Quote, 2), (Comma, 0)
(Quote, 1), (Comma, 0)
(Quote, 1), (Comma, 0)
Int
...
Quote
...
(Quote, 2), (Comma, 1)
P_(1,0)_Float
P_(1,0)_Quote
P_(1,0)_Int
Quote
P_(2,1)_Quote
26/33
28
Evaluation 1 Qualitative Judge by Human
27/33
29
Evaluation 2 Complexity Scores

Minimum Description Length (MDL) Principle
cost (in bits) of transmitting the data
cost (in bits) to transmit the description CT
cost (in bits) to transmit the data given the
description CD

28/33
30
Evaluation 3 Execution Time
29/33
31
Evaluation 4 Success Rates
30/33
32
Related Work

Grammar induction structure discovery without
token ambiguity problem
Arasu Garcia-Molina 03 extracting structure
from web pages
Garofalakis et al. 00 XTRACT for infering
DTDs
Kushmerick et al. 97 wrapper induction
Detect row table components by Hidden Markov
Model Conditional Random Fields
Pinto et al. 03
Extract certain fields in records from text
Borkar et al. 01
Predict exons and introns in DNA sequences
using generalized HMM
Kulp 96
Part-of-speech tagging in natural language
processing
Heeman 99 (Decision Tree)
Speech Recognition
Rabiner 89

31/33
33
Future Work

Statistical Model Accuracy
HHM parameters re-estimation by Balm-Welch
algorithm
Hierarchical Max-Ent Model token generating
model P(ST)
How to make use of vertical information
one record is not independent of others
not suitable for large data sets
key alignment
Conditional Random Fields
Online learning
old description, old data new data
new description

32/33
34
Contributions

Resolve the Token Ambiguity Problem by
statistical approaches
Use all possible token sequences.
Integrate 2 statistical approaches into the
learnPADS framework.
Hidden Markov Model
Hierarchical Maximum Entropy Model
Improve chunk partition in structure discovery
with the help of Seqsets.
Evaluate correctness and performance by a number
of measures
Results have shown that multiple token sequences
and statistical methods achieve partial success.

33/33
35
End
36
Extended Viterbi Algorithm with Token Counts
Input a hidden markov model H, a string
Cc1...cn , token counts tc(s1, f1), ..., (sl,
fl) Output the best token sequence that
satisfies tc.
Define
where si occurs oi times till position k.
1/9
37
Proof
2/9
38
Reduce Execution Time

Parallel Computing
seqset construction
most likely token sequence finder
embarrassingly parallel
Learn the description by a portion of the data
How many data are needed to learn a good
description?

3/9
39
Better Training Set ?
lex
hmm results use 19/20 data sources as training
data
hmm results use 19/20 data sources 5 of the
test data source as training data
4/9
40
Find Common Initial Tokens
union classify chunks into different branches by
the 1st token of each chunk
Given S i 1 to n chunki , T i 1
to k tokeni , init (chunki) tokeni1, ...
tokeniz ? T, get the smallest subset I of T
s.t. for all i 1 to n, init (chunki) nI ? F.
Set Cover Problem Given S (tokeni) chunkj
tokeni ? init (chunkj) , select a minimum number
of these sets so that the sets you have picked
contain all the elements that are contained in
any of the sets.
NP-complete approximation greedy
algorithm
5/9
41
Evaluation 1 Qualitative Judge by Human
6/9
42
Evaluation 2 Complexity Score
7/9
43
A Concrete Example Tokenization By HMM
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
wordSat white wordJun white
int24 white time063846 white
int2006 white wordcrashreporterd
punctuation int120 punctuation
punctuation messagemach_msg() reply
failed punctuation message(ipc/send)
invalid destination port
8/9
44
A Concrete Example Tokenization By Max-Ent
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
wordSat white wordJun white
int24 white time063846 white
int2006 white wordcrashreporterd
punctuation int120 punctuation
punctuation messagemach_msg() reply
failed punctuation message(ipc/send)
invalid destination port
dateSat Jun 24 white time063846
white int2006 white
wordcrashreporterd punctuation
int120 punctuation punctuation
messagemach_msg() reply failed
punctuation message(ipc/send) invalid
destination port
9/9

Write a Comment

User Comments (0)