Title: Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace
1Efficient Text and Semi-structured Data Mining
Towards Knowledge Discovery in the Cyberspace
- Hiroki Arimura
- Department of Informatics, Kyushu University,
Japan
Joint work with Tatsuya Asai, Shinji Kawasoe,
Kenji Abe, Junichiro Abe, Ryoichi Fujino, Hiroshi
Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi
Shimozono (Kyutech) Supported by Grant-in-aid for
Scientific Research on Priority Area Discovery
Science and "Infomatics" Japan Science Tech
Co., PRESTO
2Outline
- Efficient Text Data Mining
- Fast and Robust Text Mining Algorithm (ALT'98,
ISSAC'98, DS'98) - Efficient Text Index for Data Mining (CPM'01 ,
CPM'02) - Text Mining on External Storage (PAKDD'00)
- Applications
- Interactive Document browsing
- Keyword discovery form Web
- Towards Semi-structured Data Mining
- Efficinet Frequent Tree Miner (SDM'02, PKDD'02)
- Mining Semi-structured Data Streams (ICDM '02)
- Information Extraction from Web (GI'00,
FLAIRS'01) - Conclusion
3Efficient Text Data Mining with Optimized Pattern
Discovery
Joint work with Junichiro Abe, Ryoichi Fujino,
Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ),
Shinichi Shimozono (Kyutech)
4Large Text Databases
- have emerged in early 90s with the rapid
progress of computer and network technologies. - Web pages (OPENTEXT Index, GBs to TBs)
- A collection of XML / SGML documents.
- Genome databases (GenBank, PIR)
- Bibliographic databases (INSPEC, ResearchIndex)
- Emails or plain texts on a file system.
- Huge, Heterogeneous, unstructured data
- Traditional database/ data mining technology
cannot work!
5Our Research Goal
- Text Mining as an inverse of IR
- Develop a new access tool for text data that
Interactively supports human discovery from large
text collections - Key Fast and robust text mining methods
6Browsing a large collection of documents with
unknown vocabulary and structure
ltdallersgt
ltgulf gt
ltwheatgt
ltshipping gt
ltus.gt
ltsea mengt
ltstrike gt
ltport gt
ltshipsgt
ltthe gulf gt
ltvessels gt
ltiranian gt
ltattack gt
ltsilk worm missilegt
ltiran gt
ltstrike gt
Reuters 21578 21578 articles from Reuters
newswires from Feb. to Oct. in 1987 on economy
and international affairs
7Our Research Goal
- Text Mining as an inverse of IR
- Develop a new access tool for text data that
Interactively supports human discovery from large
text collections - Key Fast and robust text mining methods
8Proximity Word Association Patterns
- Association rules over arbitrary subwords.
- Ordered ordering among subwords
- Proximity the distance of consecutive subwords
are within constant k (proximity)
GTGTCACATGTTTCTGTAGAAAGAGGTCCA
CACA
AGGAT
CCAA
GTGTCACAAATTCTGTAGTATCA
Parameters the maximum number of substrings d
the proximity k
9Related Research
- Feldman and Dagan (KDD96)
- Association rules over keywords Arab, Egipt,
Iran gt Oil - Using Apriori-style algorithm of Agrawal et al
(1994) - Motowani (SIGMOD'97)
- Correlations over keywords
- Mannila and Toivonen (KDD96)
- Episodes patterns (Partially ordered set of
events) - Wang, Chirn, Marr, Shapiro, Shasha, Zhang
(SIGMOD'94) - Word Association Patterns without proximity
AGAGTATA AGAT - A generate-and-test algorithm heuristics
- Implementation for d 2, or d 1 approximate
matching - Iliopoulos, Makris, Sioutas, Tsakalidis, Tsichlas
(CPM'02, this conference!) - Model Identification Problem for maximal pairs of
strings (2-dim) - Common or frequent pattern discovery for d 2
and proximity
10Goal to find those patterns that characterize
the target collection
What properties separate the target data from
the rest of the data?
11Optimized Rule/Pattern Discovery
Data Mining Optimized data mining (IBM
DM group,1996 - 2000) Learning Theory Agnostic
PAC learning (1990s) Statistics
VapnikChervonenkis theory (1970s)
Impurity function?(p)
- Minimization of
- Prediction Error
- Information Entropy
- Gini Index
p ratio of positives that a pattern matches
12Optimized Rule/Pattern Discovery
Goodness of a pattern Goodness of the split by
the pattern Weighted average of the values of
impurity function at matched and unmatched sets
13Optimized Rule/Pattern Discovery
Evaluation function for pattern ? GS,?(?) (N1/
N) ?(M1/N1) (N0/ N) ?(M0/N0)
14Optimal Pattern Discovery Problem
- Given a set S of documents and an objective
function ? S ? 0, 1. - Problem Find an optimal pattern ? ? P that
minimizes the evaluation function - GS,?(?) (N1/ N) ?(M1/N1) (N0/ N) ?(M0/N0)
15Relation to Robust Probabilistic Learning
- Statistical Decision Theory in 70s
- VapnikChervonenkis theory (1970s)
- Computational Learning Theory in 90s
- Agnostic PAC-leaning / Robust Trainability
(Kearns et al. '92) - An algorithm that efficiently solves the
classification error minimization problem is an
efficient robust learner, that is, it can
approximate arbitrary distribution that generates
the examples from the view of classification.
(Hausser 1990) - Intractable in general (Kearns et al. 1992)
- Empirical machine learning in 90s
- The power of simple rules rigorous optimization
(Weiss Holte) - Data Mining COLT in 90s (middle)
- Efficient algorithms for simple geometric patterns
16Application to Text Mining
- Traditional method (Frequency-based)
- Finding most frequent patterns in the target set
T. - Many trivial patterns (stop-words) may hide less
frequent interesting patterns - Traditional stop-word elimination in IR may not
work
Target dataset
Iranian oil platform quwaiti tanker Silkworm
missile
iranian oil quwaiti tanker attack
frequency
the a an that of with
vocaburary
17Application to Text Mining
- Optimized Data Mining
- Finding optimal patterns
- Uses an average dataset B of documents as a
control dataset for canceling trivial patterns - Finds those patterns that appear more frequently
in the target set T and less frequently in the
control set B.
Background dataset
Target dataset
Iranian oil platform quwaiti tanker Silkworm
missile
iranian oil quwaiti tanker attack
frequency
the a an that of with
vocaburary
18Proximity Word Association Patterns
- Association rules over arbitrary subwords.
- Ordered ordering among subwords
- Proximity the distance of consecutive subwords
are within constant k (proximity)
GTGTCACATGTTTCTGTAGAAAGAGGTCCA
CACA
AGGAT
CCAA
GTGTCACAAATTCTGTAGTATCA
Parameters the maximum number of substrings d
the proximity k
19Straightforward algorithm Case the number d of
substrings is bounded
- Procedure
- Enumerate all O(n2d) proximity patterns built
from O(n2) subwords of the text. - For each pattern p, compute the score in linear
time.
- The straightforward algorith requires O(n2d1)
time and too slow to apply to real world
databases - We require more efficient algorithms that run in
time, say, O(n) to O(n log n) on real datasets.
20Theoretical result PositiveNumber d of
substrings is bounded
- Theorem For a set of random texts of total size
N, Split-Merge algorithm finds all the
k-proximity d-word association patterns that
minimize the prediction error in average time
O(kd (log N)d1 N) and space O(max(k, d) N).
A large constant in practice
d 2 4, k 2 8 (words), log N 10 20
(Reuters21578 collection of 15.6MB)
????????
Proc. ISAAC'98, LNCS 1533, 1998 New Generation
Computing. 2000
21Theoretical result NegativeNumber d of
substrings is unbounded
- Theorem If the number d of subwords is
unbounded, then there is no polynomial time
approximation algorithm that solves the optimal
pattern problem above in arbitrary small
approximation ratio assuming P?NP (MAXSNP-hard).
Details of Algorithm
Proc. ISAAC'98, LNCS 1533, 1998 New Generation
Computing. 2000
22Suffix tree array
Text
9
8
7
6
5
4
3
2
1
Data structures for efficiently storing all of
O(n2) substrings in O(n) space
a
c
b
b
a
c
b
a
- Suffix tree
- represent all the substrings in O(n) space.
- Problems
- Not space efficient.
- Dynamic reconstruction is not easy.
- Not suitable for implementation on the secondary
storage.
(1976, McCreight)
23- Basic Idea
- Reducing the best k-proximity d-word association
pattern - to the best d-dim box over the rank space
translation by suffix array
k
The position spaceconsists of all possible
pairs of positions that differ at most k
The rank spaceconsists of all pairs of the
suffices of the text ordered lexicographically.
24An O(kd (log N)d1 N) -time Algorithm
- Improvement of a generate-test algorithms
- Using d-dim Orthogonal Range Tree Structure
Two dimensional case
25An O(N2 log2 N)-time Algorithm in two dimensional
case (Slow)
- Improvement of a generate-test algorithms
- Direct use of d-dim Orthogonal Range Tree
Structure
b
a
- O(N log2 N) space / preprocess and O(log2 N)
time per query - Only slight improvement O(N2 log2 N) lt O(N3)
- Impractical too slow and space comsuming
Mean Height O(log N)
26Problems in Implementation
- Implementation of Split-Merge algorithm
- Maintaining the suffix tree on memory.
- The algorithm is too complex and requires huge
amont of memory. - Applicable only small datasets (50KB) on a
workstation with 256MB of memory.(Proc. 9th
Algorithmic Learning Theory, 1998) - Difficulty in applying to large text data
- to large applications in bio-informatics and
- text and web mining
27Implementation From Trees to Arrays
- Efficient full text Index for text mining.
- Replacing Tree with one-dim arrays
- Most operations in Split-Merge Algorithm can be
efficiently implemented by Suffix Height
arrays. - Enumeration of substrings and its occurrences is
done in linear time by simulating the DFS of the
"virtual" suffix tree with scanning the Height
array. - Reconstruction (restriction) of the suffix and
the height arrays can be done with O(n log n)
integer sorting and O(1) time LCA/range-minima
computation. (Farach-Colton, Ferragina,
Muthukrishnan '00)
T. Kasai, G. Lee, H. Arimura, S. Arikawa, K.S.
Park, "Linear-time Longest-common-prefix
computation in suffix arrays and its
applications", CPM'01 H. Arimura, CPM'01 talk.
28Implementation More Practical Algorithm
- Split-Merge-with-Array algorithm (SMA)
- Re-implemetation of SMT with Suffix Height
arrays - has the same time complexity and the slightly
imploved space complexity to SMA in average. - Easy to implement and scalable due to a simple
data structure which extensively uses
one-dimensional arrays and sorting and mapping
operations over them.
- Theorem For a set of random texts of total size
N, the SMA algorithm finds all the k-proximity
d-word association patterns that minimize the
prediction error in average time O( N (log N)d1
) and space O( max(k,d) N).
29Results
- We develop a linear-time algorithm for the
substring traversal problem that simulates the
post-order traversal of the suffix tree of a text
using the suffix array and the lcp-information. - Advantages
- More space efficient than suffix tree (7n bytes
stack vs. 15n bytes stack ). - Faster than the naive simulation with Pos alone
using binary search. - Easy to handle and implement.
- The linear-time solution for the lcp-problem is
essential. - Possible use of suffix arrays instead of suffix
trees in large-scale applications in
bioinformatics and data mining.
30Algorithm
2
1
0
3
1
0
2
0
-1
Height array
a
b
b
(5, 3)
c
a
c
a
Stack
b
(4, 1) (3, 0) (-1, -1)
c
a
Suffix array
4
1
8
2
5
6
3
7
9
31Prototype system
- Based on computational geometry techniques
- Built on a full text index called the suffix
array - Virtual traversal technique over suffix array.
- Space requirement is reduced to O(dn) with small
constants by the extensive use of suffix array
and tertiary quick sort. - g on Solaris 2.6, Sun Ultra Sparc IIi, 250MHz.
32Running time
Summary for various values of parameters d and k
- Data 15.2MB (SHIP data from Reuters 21578 data)
- Sun micro., Ultra SPARC II 300MHz, 512MB, g on
Solaris 2.6. - Best 200 patterns with entropy minimization
33Experiments on Document Browsing
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
34Application to document browsing
Target category ship Control all categories
but ship.
Optimization-based mining
Frequency-based mining
pos and neg 2970 7887 Rank Pattern
----------------------- 1 ltreuter gt 2 ltthe gt
3 ltto gt 4 ltsaid gt 5 ltof gt 6 ltand gt 7 ltin
gt 8 lta gt 9 lts gt 10 lton gt 11 ltfor gt 12 ltat
gt 13 ltby gt 14 ltsaid the gt 15 ltin the gt 16
ltwith gt 17 ltfrom gt 18 ltof the gt 19 ltwas gt 20
ltbut gt
pos and neg 2970 7887 Rank Pattern
------------------------ 1 ltshipping gt 2
ltships gt 3 ltgulf gt 4 ltvessels gt 5 ltthe gulf
gt 6 ltport gt 7 ltship gt 8 ltquwaiti gt 9
ltiranian gt 10 ltiran gt 11 ltin the gulf gt 12
lttankers gt 13 ltcargogt 14 ltvesselgt 15 ltwarships
gt 16 ltstrike gt 17 ltattack gt 18 lttanker gt 19
ltflag gt 20 ltports gt
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
35Application to document browsing
Finding the phrases that characterize the
articles of category ship from the articles with
other categories.
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
36Optimization based data mining vs. Frequency
based data mining
of Stop words
of Title words
- Title words from ltTITLEgt section of Reuters
newswires - Stop words from the standard stopword lists for
Brown corpus - Measuring the ratio of title/stop words in a
phrase found.
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
37Discovery of Important Keywords in the Cyberspace
Target dataset the base-set for the query
"HONDA"
Best 200 hits by AltaVistaTM
All pages of distance one from pages in S
Randomly selected 50 pages
Back linkspointing to S
Forward linksfrom S
root set S
1,0005,000 pages
base set T
Control dataset the base-set for the query
"Softbank"
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
38Frequency-based vs. Optimization-based
Mining patterns in the target/postive dataset
(HONDA data) using background/negative dataset
(SOFTBANK)
Automobile co. and internet business
Both are Automobile companies
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
39Dependence on Background/Negative Data
Mining patterns in the target/postive dataset
(HONDA data) varying the background/negative
dataset (SOFTBANK and TOYOTA)
Automobile co. and internet business
Both are Automobile companies
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
40Conclusion
- Text databases
- Optimized Pattern Discovery
- Proximity phrase association patterns
- Fast and robust text mining algorithms
- Split-Merge algorithm for finding the optimal
patterns - Levelwise-Scan algorithm for large disk-resident
data. - Applications
- Interactive Document browsing
- Web Mining
Please visit http//www.i.kyushu-u.ac.jp/arim/
41Semi-structured Data
ltARTICLE status draftgt ltTITLEgt Fast Text Data
Mining with optimal pattern discovery lt/TITLEgt ltAU
THORgt H. Arimura lt/AUTHORgt ltAUTHORgt T. KASAI
lt/AUTHORgt ltAUTHORgt A. WATAKI lt/AUTHORgt ltAUTHORgt
S. Arikawa lt/AUTHORgt ltABSTRACTgt This paper
consider the efficient discovery of a simple
class of patterns from large text databases.
lt/ABSTRACTgt ltSECTIONgt ltTITLEgt Introduction
lt/TITLEgt ltBODYgt Recent progress of network and
strage technology enable us to collect and
accumulate ... lt/BODYgt lt/SECTIONgt ltSECTIONgt ltTITLE
gt Preliminaries lt/TITLEgt ltBODYgt In this section,
we give basic definitions and results on ...
lt/BODYgt lt/SECTIONgt ... lt/ARTICLEgt
Web XML data
42Theoretical results
- Theorem The algorithm OPTT solves the maximum
agreement problem for labeled ordered trees in
average time O(kk bk N). - (Note A straightforward algorithm has super
linear time complexity when the number of labels
grows in N).
- Theorem If the maximum sizek of subwords is
unbounded, For any e gt 0, there exists no
polynomial time (770/767 - e)-approximation
algorithm for the maximum agreement problem for
labeled ordered trees of unbounded size on an
unbounded label alphabet if P /NP.
????????
Proc. SIAM Data Mining 02 (2001), and Proc.
PKDD'02 (2002)
43????????
?
T1
T2
T3
T4
- ??????.
- ?????,??????,??????????????
44(No Transcript)
45???????????????
- ????????????????????
- Efficient Substructure Discovery from Large
Semi-structured Data - Asai, Abe, Kawasoe, Arimura, Sakamoto, Arikawa
- Proc. 2nd SIAM International Conference on Data
Mining (SDM'02), Arlington, April 2002. (To
appear) - ?????? DE? (10?) AI?????? SIG-FAI/KBS(11?) DEWS
'02. - ??????????????????
46FREQT??????????????
- ?? ????????? Find-Freq-Trees ?,??????? 0lts?1
????,????s???????????????????. - ??,?????????????????????(PKDD'02, Aug 2002)
- ??????????????
- ??????????????,1????????????,NP??
47(No Transcript)