Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace

Description:

CACA. TGTTTCTGT. AGGAGGT. Web pages, XML/SGML archives. Genome databases. E-mails, text files ... CACA. AGGAT. CCAA. CACA. AGGAT. CCAA. GTGTCACAAATTCTGTAGTATCA ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 42

Provided by: kmMemeHo

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace

1
Efficient Text and Semi-structured Data Mining
Towards Knowledge Discovery in the Cyberspace

Hiroki Arimura
Department of Informatics, Kyushu University,
Japan

Joint work with Tatsuya Asai, Shinji Kawasoe,
Kenji Abe, Junichiro Abe, Ryoichi Fujino, Hiroshi
Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi
Shimozono (Kyutech) Supported by Grant-in-aid for
Scientific Research on Priority Area Discovery
Science and "Infomatics" Japan Science Tech
Co., PRESTO
2
Outline

Efficient Text Data Mining
Fast and Robust Text Mining Algorithm (ALT'98,
ISSAC'98, DS'98)
Efficient Text Index for Data Mining (CPM'01 ,
CPM'02)
Text Mining on External Storage (PAKDD'00)
Applications
Interactive Document browsing
Keyword discovery form Web
Towards Semi-structured Data Mining
Efficinet Frequent Tree Miner (SDM'02, PKDD'02)
Mining Semi-structured Data Streams (ICDM '02)
Information Extraction from Web (GI'00,
FLAIRS'01)
Conclusion

3
Efficient Text Data Mining with Optimized Pattern
Discovery
Joint work with Junichiro Abe, Ryoichi Fujino,
Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ),
Shinichi Shimozono (Kyutech)
4
Large Text Databases

have emerged in early 90s with the rapid
progress of computer and network technologies.
Web pages (OPENTEXT Index, GBs to TBs)
A collection of XML / SGML documents.
Genome databases (GenBank, PIR)
Bibliographic databases (INSPEC, ResearchIndex)
Emails or plain texts on a file system.
Huge, Heterogeneous, unstructured data
Traditional database/ data mining technology
cannot work!

5
Our Research Goal

Text Mining as an inverse of IR
Develop a new access tool for text data that
Interactively supports human discovery from large
text collections
Key Fast and robust text mining methods

6
Browsing a large collection of documents with
unknown vocabulary and structure
ltdallersgt
ltgulf gt
ltwheatgt
ltshipping gt
ltus.gt
ltsea mengt
ltstrike gt
ltport gt
ltshipsgt
ltthe gulf gt
ltvessels gt
ltiranian gt
ltattack gt
ltsilk worm missilegt
ltiran gt
ltstrike gt
Reuters 21578 21578 articles from Reuters
newswires from Feb. to Oct. in 1987 on economy
and international affairs
7
Our Research Goal

Text Mining as an inverse of IR
Develop a new access tool for text data that
Interactively supports human discovery from large
text collections
Key Fast and robust text mining methods

8
Proximity Word Association Patterns

Association rules over arbitrary subwords.
Ordered ordering among subwords
Proximity the distance of consecutive subwords
are within constant k (proximity)

GTGTCACATGTTTCTGTAGAAAGAGGTCCA
CACA
AGGAT
CCAA
GTGTCACAAATTCTGTAGTATCA
Parameters the maximum number of substrings d
the proximity k
9
Related Research

Feldman and Dagan (KDD96)
Association rules over keywords Arab, Egipt,
Iran gt Oil
Using Apriori-style algorithm of Agrawal et al
(1994)
Motowani (SIGMOD'97)
Correlations over keywords
Mannila and Toivonen (KDD96)
Episodes patterns (Partially ordered set of
events)
Wang, Chirn, Marr, Shapiro, Shasha, Zhang
(SIGMOD'94)
Word Association Patterns without proximity
AGAGTATA AGAT
A generate-and-test algorithm heuristics
Implementation for d 2, or d 1 approximate
matching
Iliopoulos, Makris, Sioutas, Tsakalidis, Tsichlas
(CPM'02, this conference!)
Model Identification Problem for maximal pairs of
strings (2-dim)
Common or frequent pattern discovery for d 2
and proximity

10
Goal to find those patterns that characterize
the target collection
What properties separate the target data from
the rest of the data?
11
Optimized Rule/Pattern Discovery
Data Mining Optimized data mining (IBM
DM group,1996 - 2000) Learning Theory Agnostic
PAC learning (1990s) Statistics
VapnikChervonenkis theory (1970s)
Impurity function?(p)

Minimization of
Prediction Error
Information Entropy
Gini Index

p ratio of positives that a pattern matches
12
Optimized Rule/Pattern Discovery
Goodness of a pattern Goodness of the split by
the pattern Weighted average of the values of
impurity function at matched and unmatched sets
13
Optimized Rule/Pattern Discovery
Evaluation function for pattern ? GS,?(?) (N1/
N) ?(M1/N1) (N0/ N) ?(M0/N0)
14
Optimal Pattern Discovery Problem

Given a set S of documents and an objective
function ? S ? 0, 1.
Problem Find an optimal pattern ? ? P that
minimizes the evaluation function
GS,?(?) (N1/ N) ?(M1/N1) (N0/ N) ?(M0/N0)

15
Relation to Robust Probabilistic Learning

Statistical Decision Theory in 70s
VapnikChervonenkis theory (1970s)
Computational Learning Theory in 90s
Agnostic PAC-leaning / Robust Trainability
(Kearns et al. '92)
An algorithm that efficiently solves the
classification error minimization problem is an
efficient robust learner, that is, it can
approximate arbitrary distribution that generates
the examples from the view of classification.
(Hausser 1990)
Intractable in general (Kearns et al. 1992)
Empirical machine learning in 90s
The power of simple rules rigorous optimization
(Weiss Holte)
Data Mining COLT in 90s (middle)
Efficient algorithms for simple geometric patterns

16
Application to Text Mining

Traditional method (Frequency-based)
Finding most frequent patterns in the target set
T.
Many trivial patterns (stop-words) may hide less
frequent interesting patterns
Traditional stop-word elimination in IR may not
work

Target dataset
Iranian oil platform quwaiti tanker Silkworm
missile
iranian oil quwaiti tanker attack
frequency
the a an that of with
vocaburary
17
Application to Text Mining

Optimized Data Mining
Finding optimal patterns
Uses an average dataset B of documents as a
control dataset for canceling trivial patterns
Finds those patterns that appear more frequently
in the target set T and less frequently in the
control set B.

Background dataset
Target dataset
Iranian oil platform quwaiti tanker Silkworm
missile
iranian oil quwaiti tanker attack
frequency
the a an that of with
vocaburary
18
Proximity Word Association Patterns

Association rules over arbitrary subwords.
Ordered ordering among subwords
Proximity the distance of consecutive subwords
are within constant k (proximity)

GTGTCACATGTTTCTGTAGAAAGAGGTCCA
CACA
AGGAT
CCAA
GTGTCACAAATTCTGTAGTATCA
Parameters the maximum number of substrings d
the proximity k
19
Straightforward algorithm Case the number d of
substrings is bounded

Procedure
Enumerate all O(n2d) proximity patterns built
from O(n2) subwords of the text.
For each pattern p, compute the score in linear
time.

The straightforward algorith requires O(n2d1)
time and too slow to apply to real world
databases
We require more efficient algorithms that run in
time, say, O(n) to O(n log n) on real datasets.

20
Theoretical result PositiveNumber d of
substrings is bounded

Theorem For a set of random texts of total size
N, Split-Merge algorithm finds all the
k-proximity d-word association patterns that
minimize the prediction error in average time
O(kd (log N)d1 N) and space O(max(k, d) N).

A large constant in practice
d 2 4, k 2 8 (words), log N 10 20
(Reuters21578 collection of 15.6MB)
????????
Proc. ISAAC'98, LNCS 1533, 1998 New Generation
Computing. 2000
21
Theoretical result NegativeNumber d of
substrings is unbounded

Theorem If the number d of subwords is
unbounded, then there is no polynomial time
approximation algorithm that solves the optimal
pattern problem above in arbitrary small
approximation ratio assuming P?NP (MAXSNP-hard).

Details of Algorithm
Proc. ISAAC'98, LNCS 1533, 1998 New Generation
Computing. 2000
22
Suffix tree array
Text
9
8
7
6
5
4
3
2
1
Data structures for efficiently storing all of
O(n2) substrings in O(n) space

a
c
b
b
a
c
b
a

Suffix tree
represent all the substrings in O(n) space.
Problems
Not space efficient.
Dynamic reconstruction is not easy.
Not suitable for implementation on the secondary
storage.

(1976, McCreight)
23

Basic Idea
Reducing the best k-proximity d-word association
pattern
to the best d-dim box over the rank space

translation by suffix array
k
The position spaceconsists of all possible
pairs of positions that differ at most k
The rank spaceconsists of all pairs of the
suffices of the text ordered lexicographically.
24
An O(kd (log N)d1 N) -time Algorithm

Improvement of a generate-test algorithms
Using d-dim Orthogonal Range Tree Structure

Two dimensional case
25
An O(N2 log2 N)-time Algorithm in two dimensional
case (Slow)

Improvement of a generate-test algorithms
Direct use of d-dim Orthogonal Range Tree
Structure

b
a

O(N log2 N) space / preprocess and O(log2 N)
time per query
Only slight improvement O(N2 log2 N) lt O(N3)
Impractical too slow and space comsuming

Mean Height O(log N)
26
Problems in Implementation

Implementation of Split-Merge algorithm
Maintaining the suffix tree on memory.
The algorithm is too complex and requires huge
amont of memory.
Applicable only small datasets (50KB) on a
workstation with 256MB of memory.(Proc. 9th
Algorithmic Learning Theory, 1998)
Difficulty in applying to large text data
to large applications in bio-informatics and
text and web mining

27
Implementation From Trees to Arrays

Efficient full text Index for text mining.
Replacing Tree with one-dim arrays
Most operations in Split-Merge Algorithm can be
efficiently implemented by Suffix Height
arrays.
Enumeration of substrings and its occurrences is
done in linear time by simulating the DFS of the
"virtual" suffix tree with scanning the Height
array.
Reconstruction (restriction) of the suffix and
the height arrays can be done with O(n log n)
integer sorting and O(1) time LCA/range-minima
computation. (Farach-Colton, Ferragina,
Muthukrishnan '00)

T. Kasai, G. Lee, H. Arimura, S. Arikawa, K.S.
Park, "Linear-time Longest-common-prefix
computation in suffix arrays and its
applications", CPM'01 H. Arimura, CPM'01 talk.
28
Implementation More Practical Algorithm

Split-Merge-with-Array algorithm (SMA)
Re-implemetation of SMT with Suffix Height
arrays
has the same time complexity and the slightly
imploved space complexity to SMA in average.
Easy to implement and scalable due to a simple
data structure which extensively uses
one-dimensional arrays and sorting and mapping
operations over them.

Theorem For a set of random texts of total size
N, the SMA algorithm finds all the k-proximity
d-word association patterns that minimize the
prediction error in average time O( N (log N)d1
) and space O( max(k,d) N).

29
Results

We develop a linear-time algorithm for the
substring traversal problem that simulates the
post-order traversal of the suffix tree of a text
using the suffix array and the lcp-information.
Advantages
More space efficient than suffix tree (7n bytes
stack vs. 15n bytes stack ).
Faster than the naive simulation with Pos alone
using binary search.
Easy to handle and implement.

The linear-time solution for the lcp-problem is
essential.
Possible use of suffix arrays instead of suffix
trees in large-scale applications in
bioinformatics and data mining.

30
Algorithm
2
1
0
3
1
0
2
0
-1
Height array
a
b
b
(5, 3)
c

a
c

a
Stack
b
(4, 1) (3, 0) (-1, -1)

c
a

Suffix array
4
1
8
2
5
6
3
7
9
31
Prototype system

Based on computational geometry techniques
Built on a full text index called the suffix
array
Virtual traversal technique over suffix array.
Space requirement is reduced to O(dn) with small
constants by the extensive use of suffix array
and tertiary quick sort.
g on Solaris 2.6, Sun Ultra Sparc IIi, 250MHz.

32
Running time
Summary for various values of parameters d and k

Data 15.2MB (SHIP data from Reuters 21578 data)
Sun micro., Ultra SPARC II 300MHz, 512MB, g on
Solaris 2.6.
Best 200 patterns with entropy minimization

33
Experiments on Document Browsing
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
34
Application to document browsing
Target category ship Control all categories
but ship.
Optimization-based mining
Frequency-based mining
pos and neg 2970 7887 Rank Pattern
----------------------- 1 ltreuter gt 2 ltthe gt
3 ltto gt 4 ltsaid gt 5 ltof gt 6 ltand gt 7 ltin
gt 8 lta gt 9 lts gt 10 lton gt 11 ltfor gt 12 ltat
gt 13 ltby gt 14 ltsaid the gt 15 ltin the gt 16
ltwith gt 17 ltfrom gt 18 ltof the gt 19 ltwas gt 20
ltbut gt
pos and neg 2970 7887 Rank Pattern
------------------------ 1 ltshipping gt 2
ltships gt 3 ltgulf gt 4 ltvessels gt 5 ltthe gulf
gt 6 ltport gt 7 ltship gt 8 ltquwaiti gt 9
ltiranian gt 10 ltiran gt 11 ltin the gulf gt 12
lttankers gt 13 ltcargogt 14 ltvesselgt 15 ltwarships
gt 16 ltstrike gt 17 ltattack gt 18 lttanker gt 19
ltflag gt 20 ltports gt
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
35
Application to document browsing
Finding the phrases that characterize the
articles of category ship from the articles with
other categories.
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
36
Optimization based data mining vs. Frequency
based data mining
of Stop words
of Title words

Title words from ltTITLEgt section of Reuters
newswires
Stop words from the standard stopword lists for
Brown corpus
Measuring the ratio of title/stop words in a
phrase found.

Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
37
Discovery of Important Keywords in the Cyberspace
Target dataset the base-set for the query
"HONDA"
Best 200 hits by AltaVistaTM
All pages of distance one from pages in S
Randomly selected 50 pages
Back linkspointing to S
Forward linksfrom S
root set S
1,0005,000 pages
base set T
Control dataset the base-set for the query
"Softbank"
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
38
Frequency-based vs. Optimization-based
Mining patterns in the target/postive dataset
(HONDA data) using background/negative dataset
(SOFTBANK)
Automobile co. and internet business
Both are Automobile companies
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
39
Dependence on Background/Negative Data
Mining patterns in the target/postive dataset
(HONDA data) varying the background/negative
dataset (SOFTBANK and TOYOTA)
Automobile co. and internet business
Both are Automobile companies
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
40
Conclusion

Text databases
Optimized Pattern Discovery
Proximity phrase association patterns
Fast and robust text mining algorithms
Split-Merge algorithm for finding the optimal
patterns
Levelwise-Scan algorithm for large disk-resident
data.
Applications
Interactive Document browsing
Web Mining

Please visit http//www.i.kyushu-u.ac.jp/arim/
41
Semi-structured Data
ltARTICLE status draftgt ltTITLEgt Fast Text Data
Mining with optimal pattern discovery lt/TITLEgt ltAU
THORgt H. Arimura lt/AUTHORgt ltAUTHORgt T. KASAI
lt/AUTHORgt ltAUTHORgt A. WATAKI lt/AUTHORgt ltAUTHORgt
S. Arikawa lt/AUTHORgt ltABSTRACTgt This paper
consider the efficient discovery of a simple
class of patterns from large text databases.
lt/ABSTRACTgt ltSECTIONgt ltTITLEgt Introduction
lt/TITLEgt ltBODYgt Recent progress of network and
strage technology enable us to collect and
accumulate ... lt/BODYgt lt/SECTIONgt ltSECTIONgt ltTITLE
gt Preliminaries lt/TITLEgt ltBODYgt In this section,
we give basic definitions and results on ...
lt/BODYgt lt/SECTIONgt ... lt/ARTICLEgt
Web XML data
42
Theoretical results

Theorem The algorithm OPTT solves the maximum
agreement problem for labeled ordered trees in
average time O(kk bk N).
(Note A straightforward algorithm has super
linear time complexity when the number of labels
grows in N).

Theorem If the maximum sizek of subwords is
unbounded, For any e gt 0, there exists no
polynomial time (770/767 - e)-approximation
algorithm for the maximum agreement problem for
labeled ordered trees of unbounded size on an
unbounded label alphabet if P /NP.

????????
Proc. SIAM Data Mining 02 (2001), and Proc.
PKDD'02 (2002)
43
????????
?
T1
T2
T3
T4

??????.
?????,??????,??????????????

44
(No Transcript)
45
???????????????

????????????????????
Efficient Substructure Discovery from Large
Semi-structured Data
Asai, Abe, Kawasoe, Arimura, Sakamoto, Arikawa
Proc. 2nd SIAM International Conference on Data
Mining (SDM'02), Arlington, April 2002. (To
appear)
?????? DE? (10?) AI?????? SIG-FAI/KBS(11?) DEWS
'02.
??????????????????

46
FREQT??????????????