Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace

Description:

CACA. TGTTTCTGT. AGGAGGT. Web pages, XML/SGML archives. Genome databases. E-mails, text files ... CACA. AGGAT. CCAA. CACA. AGGAT. CCAA. GTGTCACAAATTCTGTAGTATCA ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 42
Provided by: kmMemeHo
Category:

less

Transcript and Presenter's Notes

Title: Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace


1
Efficient Text and Semi-structured Data Mining
Towards Knowledge Discovery in the Cyberspace
  • Hiroki Arimura
  • Department of Informatics, Kyushu University,
    Japan

Joint work with Tatsuya Asai, Shinji Kawasoe,
Kenji Abe, Junichiro Abe, Ryoichi Fujino, Hiroshi
Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi
Shimozono (Kyutech) Supported by Grant-in-aid for
Scientific Research on Priority Area Discovery
Science and "Infomatics" Japan Science Tech
Co., PRESTO
2
Outline
  • Efficient Text Data Mining
  • Fast and Robust Text Mining Algorithm (ALT'98,
    ISSAC'98, DS'98)
  • Efficient Text Index for Data Mining (CPM'01 ,
    CPM'02)
  • Text Mining on External Storage (PAKDD'00)
  • Applications
  • Interactive Document browsing
  • Keyword discovery form Web
  • Towards Semi-structured Data Mining
  • Efficinet Frequent Tree Miner (SDM'02, PKDD'02)
  • Mining Semi-structured Data Streams (ICDM '02)
  • Information Extraction from Web (GI'00,
    FLAIRS'01)
  • Conclusion

3
Efficient Text Data Mining with Optimized Pattern
Discovery
Joint work with Junichiro Abe, Ryoichi Fujino,
Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ),
Shinichi Shimozono (Kyutech)
4
Large Text Databases
  • have emerged in early 90s with the rapid
    progress of computer and network technologies.
  • Web pages (OPENTEXT Index, GBs to TBs)
  • A collection of XML / SGML documents.
  • Genome databases (GenBank, PIR)
  • Bibliographic databases (INSPEC, ResearchIndex)
  • Emails or plain texts on a file system.
  • Huge, Heterogeneous, unstructured data
  • Traditional database/ data mining technology
    cannot work!

5
Our Research Goal
  • Text Mining as an inverse of IR
  • Develop a new access tool for text data that
    Interactively supports human discovery from large
    text collections
  • Key Fast and robust text mining methods

6
Browsing a large collection of documents with
unknown vocabulary and structure
ltdallersgt
ltgulf gt
ltwheatgt
ltshipping gt
ltus.gt
ltsea mengt
ltstrike gt
ltport gt
ltshipsgt
ltthe gulf gt
ltvessels gt
ltiranian gt
ltattack gt
ltsilk worm missilegt
ltiran gt
ltstrike gt
Reuters 21578 21578 articles from Reuters
newswires from Feb. to Oct. in 1987 on economy
and international affairs
7
Our Research Goal
  • Text Mining as an inverse of IR
  • Develop a new access tool for text data that
    Interactively supports human discovery from large
    text collections
  • Key Fast and robust text mining methods

8
Proximity Word Association Patterns
  • Association rules over arbitrary subwords.
  • Ordered ordering among subwords
  • Proximity the distance of consecutive subwords
    are within constant k (proximity)

GTGTCACATGTTTCTGTAGAAAGAGGTCCA
CACA
AGGAT
CCAA
GTGTCACAAATTCTGTAGTATCA
Parameters the maximum number of substrings d
the proximity k
9
Related Research
  • Feldman and Dagan (KDD96)
  • Association rules over keywords Arab, Egipt,
    Iran gt Oil
  • Using Apriori-style algorithm of Agrawal et al
    (1994)
  • Motowani (SIGMOD'97)
  • Correlations over keywords
  • Mannila and Toivonen (KDD96)
  • Episodes patterns (Partially ordered set of
    events)
  • Wang, Chirn, Marr, Shapiro, Shasha, Zhang
    (SIGMOD'94)
  • Word Association Patterns without proximity
    AGAGTATA AGAT
  • A generate-and-test algorithm heuristics
  • Implementation for d 2, or d 1 approximate
    matching
  • Iliopoulos, Makris, Sioutas, Tsakalidis, Tsichlas
    (CPM'02, this conference!)
  • Model Identification Problem for maximal pairs of
    strings (2-dim)
  • Common or frequent pattern discovery for d 2
    and proximity

10
Goal to find those patterns that characterize
the target collection
What properties separate the target data from
the rest of the data?
11
Optimized Rule/Pattern Discovery
Data Mining Optimized data mining (IBM
DM group,1996 - 2000) Learning Theory Agnostic
PAC learning (1990s) Statistics
VapnikChervonenkis theory (1970s)
Impurity function?(p)
  • Minimization of
  • Prediction Error
  • Information Entropy
  • Gini Index

p ratio of positives that a pattern matches
12
Optimized Rule/Pattern Discovery
Goodness of a pattern Goodness of the split by
the pattern Weighted average of the values of
impurity function at matched and unmatched sets
13
Optimized Rule/Pattern Discovery
Evaluation function for pattern ? GS,?(?) (N1/
N) ?(M1/N1) (N0/ N) ?(M0/N0)
14
Optimal Pattern Discovery Problem
  • Given a set S of documents and an objective
    function ? S ? 0, 1.
  • Problem Find an optimal pattern ? ? P that
    minimizes the evaluation function
  • GS,?(?) (N1/ N) ?(M1/N1) (N0/ N) ?(M0/N0)

15
Relation to Robust Probabilistic Learning
  • Statistical Decision Theory in 70s
  • VapnikChervonenkis theory (1970s)
  • Computational Learning Theory in 90s
  • Agnostic PAC-leaning / Robust Trainability
    (Kearns et al. '92)
  • An algorithm that efficiently solves the
    classification error minimization problem is an
    efficient robust learner, that is, it can
    approximate arbitrary distribution that generates
    the examples from the view of classification.
    (Hausser 1990)
  • Intractable in general (Kearns et al. 1992)
  • Empirical machine learning in 90s
  • The power of simple rules rigorous optimization
    (Weiss Holte)
  • Data Mining COLT in 90s (middle)
  • Efficient algorithms for simple geometric patterns

16
Application to Text Mining
  • Traditional method (Frequency-based)
  • Finding most frequent patterns in the target set
    T.
  • Many trivial patterns (stop-words) may hide less
    frequent interesting patterns
  • Traditional stop-word elimination in IR may not
    work

Target dataset
Iranian oil platform quwaiti tanker Silkworm
missile
iranian oil quwaiti tanker attack
frequency
the a an that of with
vocaburary
17
Application to Text Mining
  • Optimized Data Mining
  • Finding optimal patterns
  • Uses an average dataset B of documents as a
    control dataset for canceling trivial patterns
  • Finds those patterns that appear more frequently
    in the target set T and less frequently in the
    control set B.

Background dataset
Target dataset
Iranian oil platform quwaiti tanker Silkworm
missile
iranian oil quwaiti tanker attack
frequency
the a an that of with
vocaburary
18
Proximity Word Association Patterns
  • Association rules over arbitrary subwords.
  • Ordered ordering among subwords
  • Proximity the distance of consecutive subwords
    are within constant k (proximity)

GTGTCACATGTTTCTGTAGAAAGAGGTCCA
CACA
AGGAT
CCAA
GTGTCACAAATTCTGTAGTATCA
Parameters the maximum number of substrings d
the proximity k
19
Straightforward algorithm Case the number d of
substrings is bounded
  • Procedure
  • Enumerate all O(n2d) proximity patterns built
    from O(n2) subwords of the text.
  • For each pattern p, compute the score in linear
    time.
  • The straightforward algorith requires O(n2d1)
    time and too slow to apply to real world
    databases
  • We require more efficient algorithms that run in
    time, say, O(n) to O(n log n) on real datasets.

20
Theoretical result PositiveNumber d of
substrings is bounded
  • Theorem For a set of random texts of total size
    N, Split-Merge algorithm finds all the
    k-proximity d-word association patterns that
    minimize the prediction error in average time
    O(kd (log N)d1 N) and space O(max(k, d) N).

A large constant in practice
d 2 4, k 2 8 (words), log N 10 20
(Reuters21578 collection of 15.6MB)
????????
Proc. ISAAC'98, LNCS 1533, 1998 New Generation
Computing. 2000
21
Theoretical result NegativeNumber d of
substrings is unbounded
  • Theorem If the number d of subwords is
    unbounded, then there is no polynomial time
    approximation algorithm that solves the optimal
    pattern problem above in arbitrary small
    approximation ratio assuming P?NP (MAXSNP-hard).

Details of Algorithm
Proc. ISAAC'98, LNCS 1533, 1998 New Generation
Computing. 2000
22
Suffix tree array
Text
9
8
7
6
5
4
3
2
1
Data structures for efficiently storing all of
O(n2) substrings in O(n) space

a
c
b
b
a
c
b
a
  • Suffix tree
  • represent all the substrings in O(n) space.
  • Problems
  • Not space efficient.
  • Dynamic reconstruction is not easy.
  • Not suitable for implementation on the secondary
    storage.

(1976, McCreight)
23
  • Basic Idea
  • Reducing the best k-proximity d-word association
    pattern
  • to the best d-dim box over the rank space

translation by suffix array
k
The position spaceconsists of all possible
pairs of positions that differ at most k
The rank spaceconsists of all pairs of the
suffices of the text ordered lexicographically.
24
An O(kd (log N)d1 N) -time Algorithm
  • Improvement of a generate-test algorithms
  • Using d-dim Orthogonal Range Tree Structure

Two dimensional case
25
An O(N2 log2 N)-time Algorithm in two dimensional
case (Slow)
  • Improvement of a generate-test algorithms
  • Direct use of d-dim Orthogonal Range Tree
    Structure

b
a
  • O(N log2 N) space / preprocess and O(log2 N)
    time per query
  • Only slight improvement O(N2 log2 N) lt O(N3)
  • Impractical too slow and space comsuming

Mean Height O(log N)
26
Problems in Implementation
  • Implementation of Split-Merge algorithm
  • Maintaining the suffix tree on memory.
  • The algorithm is too complex and requires huge
    amont of memory.
  • Applicable only small datasets (50KB) on a
    workstation with 256MB of memory.(Proc. 9th
    Algorithmic Learning Theory, 1998)
  • Difficulty in applying to large text data
  • to large applications in bio-informatics and
  • text and web mining

27
Implementation From Trees to Arrays
  • Efficient full text Index for text mining.
  • Replacing Tree with one-dim arrays
  • Most operations in Split-Merge Algorithm can be
    efficiently implemented by Suffix Height
    arrays.
  • Enumeration of substrings and its occurrences is
    done in linear time by simulating the DFS of the
    "virtual" suffix tree with scanning the Height
    array.
  • Reconstruction (restriction) of the suffix and
    the height arrays can be done with O(n log n)
    integer sorting and O(1) time LCA/range-minima
    computation. (Farach-Colton, Ferragina,
    Muthukrishnan '00)

T. Kasai, G. Lee, H. Arimura, S. Arikawa, K.S.
Park, "Linear-time Longest-common-prefix
computation in suffix arrays and its
applications", CPM'01 H. Arimura, CPM'01 talk.
28
Implementation More Practical Algorithm
  • Split-Merge-with-Array algorithm (SMA)
  • Re-implemetation of SMT with Suffix Height
    arrays
  • has the same time complexity and the slightly
    imploved space complexity to SMA in average.
  • Easy to implement and scalable due to a simple
    data structure which extensively uses
    one-dimensional arrays and sorting and mapping
    operations over them.
  • Theorem For a set of random texts of total size
    N, the SMA algorithm finds all the k-proximity
    d-word association patterns that minimize the
    prediction error in average time O( N (log N)d1
    ) and space O( max(k,d) N).

29
Results
  • We develop a linear-time algorithm for the
    substring traversal problem that simulates the
    post-order traversal of the suffix tree of a text
    using the suffix array and the lcp-information.
  • Advantages
  • More space efficient than suffix tree (7n bytes
    stack vs. 15n bytes stack ).
  • Faster than the naive simulation with Pos alone
    using binary search.
  • Easy to handle and implement.
  • The linear-time solution for the lcp-problem is
    essential.
  • Possible use of suffix arrays instead of suffix
    trees in large-scale applications in
    bioinformatics and data mining.

30
Algorithm
2
1
0
3
1
0
2
0
-1
Height array
a
b
b
(5, 3)
c

a
c

a
Stack
b
(4, 1) (3, 0) (-1, -1)

c
a



Suffix array
4
1
8
2
5
6
3
7
9
31
Prototype system
  • Based on computational geometry techniques
  • Built on a full text index called the suffix
    array
  • Virtual traversal technique over suffix array.
  • Space requirement is reduced to O(dn) with small
    constants by the extensive use of suffix array
    and tertiary quick sort.
  • g on Solaris 2.6, Sun Ultra Sparc IIi, 250MHz.

32
Running time
Summary for various values of parameters d and k
  • Data 15.2MB (SHIP data from Reuters 21578 data)
  • Sun micro., Ultra SPARC II 300MHz, 512MB, g on
    Solaris 2.6.
  • Best 200 patterns with entropy minimization

33
Experiments on Document Browsing
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
34
Application to document browsing
Target category ship Control all categories
but ship.
Optimization-based mining
Frequency-based mining
pos and neg 2970 7887 Rank Pattern
----------------------- 1 ltreuter gt 2 ltthe gt
3 ltto gt 4 ltsaid gt 5 ltof gt 6 ltand gt 7 ltin
gt 8 lta gt 9 lts gt 10 lton gt 11 ltfor gt 12 ltat
gt 13 ltby gt 14 ltsaid the gt 15 ltin the gt 16
ltwith gt 17 ltfrom gt 18 ltof the gt 19 ltwas gt 20
ltbut gt
pos and neg 2970 7887 Rank Pattern
------------------------ 1 ltshipping gt 2
ltships gt 3 ltgulf gt 4 ltvessels gt 5 ltthe gulf
gt 6 ltport gt 7 ltship gt 8 ltquwaiti gt 9
ltiranian gt 10 ltiran gt 11 ltin the gulf gt 12
lttankers gt 13 ltcargogt 14 ltvesselgt 15 ltwarships
gt 16 ltstrike gt 17 ltattack gt 18 lttanker gt 19
ltflag gt 20 ltports gt
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
35
Application to document browsing
Finding the phrases that characterize the
articles of category ship from the articles with
other categories.
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
36
Optimization based data mining vs. Frequency
based data mining
of Stop words
of Title words
  • Title words from ltTITLEgt section of Reuters
    newswires
  • Stop words from the standard stopword lists for
    Brown corpus
  • Measuring the ratio of title/stop words in a
    phrase found.

Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
37
Discovery of Important Keywords in the Cyberspace
Target dataset the base-set for the query
"HONDA"
Best 200 hits by AltaVistaTM
All pages of distance one from pages in S
Randomly selected 50 pages
Back linkspointing to S
Forward linksfrom S
root set S
1,0005,000 pages
base set T
Control dataset the base-set for the query
"Softbank"
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
38
Frequency-based vs. Optimization-based
Mining patterns in the target/postive dataset
(HONDA data) using background/negative dataset
(SOFTBANK)
Automobile co. and internet business
Both are Automobile companies
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
39
Dependence on Background/Negative Data
Mining patterns in the target/postive dataset
(HONDA data) varying the background/negative
dataset (SOFTBANK and TOYOTA)
Automobile co. and internet business
Both are Automobile companies
Arimura, Abe, Fujino, Sakamoto, Shimozono,
Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital
Library, 2000.
40
Conclusion
  • Text databases
  • Optimized Pattern Discovery
  • Proximity phrase association patterns
  • Fast and robust text mining algorithms
  • Split-Merge algorithm for finding the optimal
    patterns
  • Levelwise-Scan algorithm for large disk-resident
    data.
  • Applications
  • Interactive Document browsing
  • Web Mining

Please visit http//www.i.kyushu-u.ac.jp/arim/
41
Semi-structured Data
ltARTICLE status draftgt ltTITLEgt Fast Text Data
Mining with optimal pattern discovery lt/TITLEgt ltAU
THORgt H. Arimura lt/AUTHORgt ltAUTHORgt T. KASAI
lt/AUTHORgt ltAUTHORgt A. WATAKI lt/AUTHORgt ltAUTHORgt
S. Arikawa lt/AUTHORgt ltABSTRACTgt This paper
consider the efficient discovery of a simple
class of patterns from large text databases.
lt/ABSTRACTgt ltSECTIONgt ltTITLEgt Introduction
lt/TITLEgt ltBODYgt Recent progress of network and
strage technology enable us to collect and
accumulate ... lt/BODYgt lt/SECTIONgt ltSECTIONgt ltTITLE
gt Preliminaries lt/TITLEgt ltBODYgt In this section,
we give basic definitions and results on ...
lt/BODYgt lt/SECTIONgt ... lt/ARTICLEgt
Web XML data
42
Theoretical results
  • Theorem The algorithm OPTT solves the maximum
    agreement problem for labeled ordered trees in
    average time O(kk bk N).
  • (Note A straightforward algorithm has super
    linear time complexity when the number of labels
    grows in N).
  • Theorem If the maximum sizek of subwords is
    unbounded, For any e gt 0, there exists no
    polynomial time (770/767 - e)-approximation
    algorithm for the maximum agreement problem for
    labeled ordered trees of unbounded size on an
    unbounded label alphabet if P /NP.

????????
Proc. SIAM Data Mining 02 (2001), and Proc.
PKDD'02 (2002)
43
????????
?
T1
T2
T3
T4
  • ??????.
  • ?????,??????,??????????????

44
(No Transcript)
45
???????????????
  • ????????????????????
  • Efficient Substructure Discovery from Large
    Semi-structured Data
  • Asai, Abe, Kawasoe, Arimura, Sakamoto, Arikawa
  • Proc. 2nd SIAM International Conference on Data
    Mining (SDM'02), Arlington, April 2002. (To
    appear)
  • ?????? DE? (10?) AI?????? SIG-FAI/KBS(11?) DEWS
    '02.
  • ??????????????????

46
FREQT??????????????
  • ?? ????????? Find-Freq-Trees ?,??????? 0lts?1
    ????,????s???????????????????.
  • ??,?????????????????????(PKDD'02, Aug 2002)
  • ??????????????
  • ??????????????,1????????????,NP??

47
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com