Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data

Description:

Table 1-9: ADA-Accessible Rail Transit Stations by Agency,Type of rail transit / agency,Primary city served,Number of stations,Number of ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 45
Provided by: Qian68
Category:

less

Transcript and Presenter's Notes

Title: Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data


1
Statistical Magic Progress in Automatic Tool
Generationfor Ad Hoc Data
  • Qian Xi
  • 2008/5/13

Joint work with Professor David Walker, Kathleen
Fisher (ATT) Kenny Zhu
2
Ad Hoc Data
  • Standardized data formats HTML, XML
  • Data processing tools Visualizers (HTML
    browsers), XQuery
  • Non-standard, semi-structured
  • Not many data processing tools
  • Examples web server log (CLF), phone call
    provisioning data, train schedule, stock
    trading info

Table 1-9 ADA-Accessible Rail Transit Stations
by Agency,,,,,,,,,,,,,,, Type of rail transit /
agency,Primary city served,Number of
stations,,,,,,,Number of ADA-accessible
stations,,,,,, ,,1996,1997,1998,1999,2000,2001,200
2,1996,1997,1998,1999,2000,2001,2002 Heavy
rail,,,,,,,,,,,,,,, Bay Area Rapid Transit,"San
Francisco, CA",36,39,39,39,39,39,39,36,39,39,39,39
,39,39 Los Angeles County Metropolitan
Transportation Authority ,"Los Angeles,
CA",5,8,8,13,16,16,16,5,8,8,13,16,16,16
91522729152272128136400922813640092281364009
22813640092no_ii152272EDTF_60MARVINS1UNO10
1000295291 915227291522721281364009228136400
9228136400922813640092no_ii15222EDTF_60MARV
INS1UNO101000295291201000295291171001649600
191001
YHOO YAHOO INC 23.93 1015AM ET 4.74 (16.53)
103,880,260 MSFT MICROSOFT CP 29.93 1015AM ET
0.69 (2.36) 39,165,715 SPY SP DEP RECEIPTS
141.52 1010AM ET 0.01 (0.01) 20,723,717 QQQQ
POWERSHARES QQQ TR 148.88 1015AM ET 0.11
(0.23) 22,074,278 CSCO CISCO SYS INC 26.68
1015AM ET 0.07 (0.26) 13,934,552 CFC
COUNTRYWIDE FNL CP 5.36 1010AM ET 0.62 (10.37)
13,019,603
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200
- - 16/Oct/1997143222 -0700 "POST
/scpt/ddorg/confirm HTTP/1.0" 200 941
1/33
3
Analytical Tasks
  • Format Converters XML converter
  • Statistical Analyzers
  • Which pages on the website are visited most
    frequently?
  • Visualizers

2/33
graph from http//www.data360.org
4
learnPADS Goal
  • Automatically generates a description of the
    format
  • Automatically generates a suite of data
    processing tools

XML converter
0,24 bar,end foo,16
Grapher
...
3/33
5
learnPADS Architecture
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
4/33
Slide credit Kenny Zhu, POPL08 talk
6
learnPADS Architecture
XML converter
Raw Data
Profiler
Chunking Tokenization
Chunking Tokenization
Format Inference Engine
0, 24 bar, end foo, 16
Quote Int Comma Int Quote Quote String Comma
String Quote Quote String Comma Int Quote
Quote (String Int) Comma (String Int)
Quote
Structure Discovery
Structure Discovery
Format Refinement
PADS Compiler
Data Description
5/33
7
Motivation Token Ambiguity Problem (TAP)
  • Given a string, therere multiple ways to
    tokenize it.
  • Example 1 127.0.0.1
  • IP
  • Float Dot Float
  • Int Dot Int Dot Int Dot Int
  • Example 2
  • Message
  • Word White Word White Word White... White URL
  • Word White Quote Filepath Quote White Word
    White...

6/33
8
How Does learnPADS deal with TAP ?
  • Tokenization Phase
  • Take the first, longest match.

Float
  • A fixed order is assigned by the end user.
  • We have no order to pick.

Int
ID
Path
As a result, the current learning system cant
have ambiguous base tokens Message, Text,
ID. sometimes produces descriptions that are too
precise.
7/33
9
A Concrete Example Tokenization By Lex
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
8/33
10
Inspiration
  • Human distinguish tokens by their background
    knowledge
  • Purl usually starts from http//
  • Pdate March with 123055 nearby
  • Ptext and Pmessage long, sometimes go to the
    end of a line
  • This knowledge can be encoded in statistical
    models.
  • Statistical models are very successful in
    Natural Language Processing, Speech Recognition.

9/33
11
learnPADS Architecture Recall
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
10/33
12
Tokenization Problem Specification
  • Inputs
  • A set of tokens with regular expression
    definitions
  • A collection of strings annotated with token
    sequences
  • a tool to label the chunks automatically given a
    description
  • A test string
  • Output
  • a valid, best token sequence for test string
  • Supervised learning problem

11/33
13
Token sequence representation Seqset
  • A valid token sequence for a given string
  • can parse the string
  • takes the longest match rule.
  • An example 0.1
  • valid Int Dot Int
  • invalid Float Dot Int
  • Seqset the directed acyclic graph (DAG) to
    represent all possible token sequences of a
    chunk.
  • Vertices characters / positions
  • Edges tokens from the pos after the starts to
    the pos of the ends
  • String with 26 chars 14,521,680 token sequences

Float
Int
Dot
Int
S
0
.
1
Float
Float
12/33
14
2-step Tokenization Algorithm
  • Given token definitions of regular expressions,
    construct the Seqsets.
  • For each record, find the most likely token
    sequence
  • a Hidden Markov Model
  • a Hierarchical Maximum-Entropy Model

13/33
15
Hidden Markov Model (HMM)
  • Observation ci
  • Character vs. Character Feature Vector
  • Character Features upper/lower case, digit,
    punctuation...
  • Hidden state si token (partial token)

Quote
Word
Word
Word
Comma
Int
Int
Quote

f
o
o
,
1
6

14/33
16
Hidden Markov Model Parameters
15/33
17
Test Data Sources
16/33
18
HHM Discussion
Error rate percentage of tokens not identified
w.r.t. the labeled token sequences
lex fixed token priority, take the first,
longest match
17/33
19
More HMM Tokenization Results
18/33
20
Hierarchical Maximum-Entropy Model
Quote
Comma
Word
Quote
Int
,

foo

16
19/33
21
Hierarchical Max-Ent Model Discussion
Average log probabilities log likelihood /
length of token sequence
s qxi qxi_at_cs.princeton.edu 1.63 P( ID
qxi ) 0.8 P( White ) 1.0 P( Email
qxi_at_cs.princeton.edu ) 0.9 P( Float 1.63
) 0.9 P( White Others ) 0.7 P( Others
White ) 0.8 normal log P( ID White Email
White Float qxi qxi_at_cs.princeton.edu 1.63 )
0.8 1.0 0.9 1.0 0.9 0.7 0.8
0.7 0.8 0.203 P( Blob qxi
qxi_at_cs.princeton.edu 1.63) 0.3 average
log P( ID White Email White Float qxi
qxi_at_cs.princeton.edu 1.63 ) (-0.097)
(-0.046) (-0.046) (-0.155) (-0.097)
(-0.155) (-0.097)/5 -0.139 P( Blob qxi
qxi_at_cs.princeton.edu 1.63) -0.523
20/33
22
Hierarchical Max-Ent Model Discussion
Error rate percentage of tokens not identified
w.r.t. the labeled token sequences
lex
normal emission probabilities
average log emission probabilities
21/33
23
Hierarchical Max-Ent Model Results
22/33
24
Lex vs. HMM vs. Hierarchical Max-Ent
  • Ambiguity is increased.
  • Training corpus is not large enough.

23/33
25
learnPADS Architecture Recall
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
24/33
26
Structure Discovery Phase
0,24 bar,end foo,16
Struct, Union, Array
Quote Int Comma Int Quote Quote String Comma
String Quote Quote String Comma Int Quote
Quote Int Comma Int Quote Quote
String Comma String Quote Quote String Comma
Int Quote
Struct
Int String String
Int String Int
Quote
Comma
Union
Union
Quote
String
Int
String
Int
Struct classify chunks by token counts (Quote,
2), (Comma, 1) .
25/33
27
Extended Viterbi Algorithm
token counts (Quote, 2), (Comma, 1)
pos n
(Quote, 0), (Comma, 0)
pos 0
pos 1
(Quote, 0), (Comma, 0)
(Quote, 0), (Comma, 0)
Msg
Msg
(Quote, 1), (Comma, 0)
...
...
P_(0,0)_Txt
P_(0,0)_Msg
P_(0,0)_Txt
P_(0,0)_Msg
. . .
(Quote, 2), (Comma, 0)
(Quote, 1), (Comma, 0)
(Quote, 1), (Comma, 0)
Int
...
Quote
...
(Quote, 2), (Comma, 1)
P_(1,0)_Float
P_(1,0)_Quote
P_(1,0)_Int
Quote
P_(2,1)_Quote
26/33
28
Evaluation 1 Qualitative Judge by Human
27/33
29
Evaluation 2 Complexity Scores
  • Minimum Description Length (MDL) Principle
  • cost (in bits) of transmitting the data
  • cost (in bits) to transmit the description CT
  • cost (in bits) to transmit the data given the
    description CD

28/33
30
Evaluation 3 Execution Time
29/33
31
Evaluation 4 Success Rates
30/33
32
Related Work
  • Grammar induction structure discovery without
    token ambiguity problem
  • Arasu Garcia-Molina 03 extracting structure
    from web pages
  • Garofalakis et al. 00 XTRACT for infering
    DTDs
  • Kushmerick et al. 97 wrapper induction
  • Detect row table components by Hidden Markov
    Model Conditional Random Fields
  • Pinto et al. 03
  • Extract certain fields in records from text
  • Borkar et al. 01
  • Predict exons and introns in DNA sequences
    using generalized HMM
  • Kulp 96
  • Part-of-speech tagging in natural language
    processing
  • Heeman 99 (Decision Tree)
  • Speech Recognition
  • Rabiner 89

31/33
33
Future Work
  • Statistical Model Accuracy
  • HHM parameters re-estimation by Balm-Welch
    algorithm
  • Hierarchical Max-Ent Model token generating
    model P(ST)
  • How to make use of vertical information
  • one record is not independent of others
  • not suitable for large data sets
  • key alignment
  • Conditional Random Fields
  • Online learning
  • old description, old data new data
    new description

32/33
34
Contributions
  • Resolve the Token Ambiguity Problem by
    statistical approaches
  • Use all possible token sequences.
  • Integrate 2 statistical approaches into the
    learnPADS framework.
  • Hidden Markov Model
  • Hierarchical Maximum Entropy Model
  • Improve chunk partition in structure discovery
    with the help of Seqsets.
  • Evaluate correctness and performance by a number
    of measures
  • Results have shown that multiple token sequences
    and statistical methods achieve partial success.

33/33
35
End
36
Extended Viterbi Algorithm with Token Counts
Input a hidden markov model H, a string
Cc1...cn , token counts tc(s1, f1), ..., (sl,
fl) Output the best token sequence that
satisfies tc.
Define
where si occurs oi times till position k.
1/9
37
Proof
2/9
38
Reduce Execution Time
  • Parallel Computing
  • seqset construction
  • most likely token sequence finder
  • embarrassingly parallel
  • Learn the description by a portion of the data
  • How many data are needed to learn a good
    description?

3/9
39
Better Training Set ?
lex
hmm results use 19/20 data sources as training
data
hmm results use 19/20 data sources 5 of the
test data source as training data
4/9
40
Find Common Initial Tokens
union classify chunks into different branches by
the 1st token of each chunk
Given S i 1 to n chunki , T i 1
to k tokeni , init (chunki) tokeni1, ...
tokeniz ? T, get the smallest subset I of T
s.t. for all i 1 to n, init (chunki) nI ? F.
Set Cover Problem Given S (tokeni) chunkj
tokeni ? init (chunkj) , select a minimum number
of these sets so that the sets you have picked
contain all the elements that are contained in
any of the sets.
NP-complete approximation greedy
algorithm
5/9
41
Evaluation 1 Qualitative Judge by Human
6/9
42
Evaluation 2 Complexity Score
7/9
43
A Concrete Example Tokenization By HMM
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
wordSat white wordJun white
int24 white time063846 white
int2006 white wordcrashreporterd
punctuation int120 punctuation
punctuation messagemach_msg() reply
failed punctuation message(ipc/send)
invalid destination port
8/9
44
A Concrete Example Tokenization By Max-Ent
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
wordSat white wordJun white
int24 white time063846 white
int2006 white wordcrashreporterd
punctuation int120 punctuation
punctuation messagemach_msg() reply
failed punctuation message(ipc/send)
invalid destination port
dateSat Jun 24 white time063846
white int2006 white
wordcrashreporterd punctuation
int120 punctuation punctuation
messagemach_msg() reply failed
punctuation message(ipc/send) invalid
destination port
9/9
Write a Comment
User Comments (0)
About PowerShow.com