Title: Granulating the Semantics Space of Web Documents
1Granulating the Semantics Space of Web Documents
- Tsau Young (T. Y.) Lin
- Computer Science Department, San Jose State
University - San Jose, CA 95192-0249, USA
- tylin_at_cs.sjsu.edu
- and
- I-Jen Chiang
- Graduate Institute of Medical Informatic,
- Taipei Medical University, Taipei, Taiwan 110
- ijchiang_at_tmu.edu.tw
2Main results
- A set of documents is associated with a Matrix,
called Latent Semantic Index, Then by treating
the row vectors as Euclidean space points, The
document is clustered(categorized) - polyhedron, the association is believed to be
one-to-one - Corollary A set of English documents and their
Chinese translations can be identified via their
semantics automatically.
3Main results
- A set of documents is associated with a
polyhedron, the association is believed to be
near one-to-one - Corollary A set of English documents and their
Chinese translations can be identified via their
semantics automatically.
4Main results
- This is identified by semantics,as there is no
explicit correspondence between two sets of
documents.
5Outline
- 1. Introduction
- Domain Information Ocean
- Methodology Granular Computing
- Reaults
- 2. Intuitive View of Granular Computing
- 3. A Formal Theory
- 4.
- 2
6Current State
- Current search engines are syntactic based
systems, they often return many meaningless web
pages - Cause Inadequate semantic analysis, and lack of
semantic based organization of information ocean.
-
7Information Ocean
- Internet is an information ocean.
- It needs a methodology to navigate.
- A new methodology-Granular Computing
8Granular Computing-a methodology
- The term granular computing is first used to
label a subset of Zadehs - granular mathematics as my research area in
BISC, 1996-97 - (Zadeh, L.A. (1998) Some reflections on soft
computing, granular computing and their roles in
the conception, design and utilization of
information/intelligent systems, Soft Computing,
2, 23-25.)
9 Granular computing
- Since, then, it has grown into an active research
area - books, sessions, workshops
- (Zhong, Lin was the first independent conference
using - Name GrC there has several in JCIS)
- IEEE task force
10 Granular Computing
- Granulation seems to be a natural
problem-solving methodology deeply rooted in
human thinking. - Human body has been granulated into head, neck,
and etc.
11Granulating Information Ocean
- In this talk, we will explain how we granulate
the semantic space of information ocean that
consists of millions of web pages
12 Organizing Information Ocean
- How to organize the information ocean?
- Considering the Semantics Space
13Latent Semantic Space
- A set of documents/web pages carries certain
human thoughts. We will call the totality of
these thoughts - Latent semantic space (LSS)
- (recall Latent Semantic Index(LSI)
14Classification clustering
- In data mining,
- a classification means identify an unseen object
with one of the known classes in a partition - Clustering means classify a set of object into
disjoint classes based on similarity, distance,
and etc. the key ingredient here is the classes
are not known apriori. -
15Categorizing Information
- Multiple concepts can simultaneously exist in a
single web page, So to organize web pages, a
powerful - Clustering
- method is needed.
- (The of concepts can not be known apriori)
16Latent Semantic Space(LSS)
- The simplest representations of LSS?
- A Set of Keywords
- LSI
17Latent Semantic Index
Key1 Key2 KeyN
Doc1 TFIDF1 TFIDF2 TFIDFn
Doc2
TFIDF
DocM TFIDF
18TFIDF
- Definition 1. Let Tr denote a collection of
documents. The significance of a term ti in - a document dj in Tr is its TFIDF value
calculated by the function tfidf(ti, dj), which
is equivalent to the value tf(ti, dj) idf(ti,
dj). It can be calculated as - TFIDF(ti dj)tf(ti dj) log Tr/Tr(ti)
19TFIDF
- where Tr(ti) denotes the number of documents in
Tr in which ti occurs at least once, - 1 log(N(ti dj)) if N(ti
dj) gt 0 - tf(ti dj)
- 0 otherwise
- where N(ti, dj) denotes the frequency of terms ti
occurs in document dj by counting all its nonstop
words.
20TFIDF
- where Tr(ti) denotes the number of documents in
Tr in which ti occurs at least once, - 1 log(N(ti dj)) if N(ti
dj) gt 0 - tf(ti dj)
- 0 otherwise
- where N(ti, dj) denotes the frequency of terms ti
occurs in document dj by counting all its nonstop
words.
21Latent Semantic Index
- Treat each row as a point in Euclidean space.
Clustering such a set of points is a common
approach (using SVD) - Note that the points has very little to do with
the semantic of documents
22Topological Space of LSS
- Euclidean space has many metics but has only one
topology - We will use this one
23Keywords (0-Association)
- 1. Given by Experts
- 2. High TFIDF is a Keyword
- Wall, Door. . ., Street, Ave
24 Keywords Pairs (1-Association)
- 1-association
- (Wall, Street) ? financial notion,
- that nothing to do with the two vertices, Wall
and Street
25 Keywords Pairs (1-Association)
- 1-association
- (White, House) ?
- that nothing to do with the two vertices, White
and House
26 Keywords Pairs (1-Association)
- 1-association
- (Neural, Network) ?
- that nothing to do with the two vertices, Wall
and Street
27Geometric Analogy-1- Simplex
- (open) 1-simplex
- (v0,v1) ? open segment
- (Wall, Street) ? financial notion,
- End points (boundaries) are not included
28Keywords are abstract vertices
- LSS of Documents/web pages
- ? Simplicial Complex
- A special Hypergraph
- Polyhedron ? Simplicial Complex
29r-Association
- r-association
- Similarly r-association represents some semantic
generated by a set of r keywords, moreover the
semantics may have nothing to do with the
individual keywords - There are mathematical structure that reflects
such properties see next
30Topology(Open) Simplex
- 1-simplex open segment (v0,v1)
- 2-simplex open triangle (v0,v1, v2)
- 3-simplex open tetrahedron (v0,v1, v2 , v3)
- All boundaries are not included
31Topology (Open) Simplex
- A (open) r-simplex is the generalization of those
low dimensional simplexes (segment, triangle
and tetrahedron) to high dimensional analogy in
r-space (Euclidean spaces of dimension r) - Theorem. r-simplex uniquely determines the r1
linearly independent vertices, and vice versa
32Face
- The convex hull of any m vertices of the
r-simplex is called an m-face. - The 0-faces are the vertices, the 1-faces are the
edges, 2-faces are triangles, and the single
r-face is the whole r-simplex itself.
33A line segment where two faces of a polyhedron
meet, also called a side.
34n-Complex
- A simplicial complex C is a finite set of
simplices such that - Any face of a simplex from C is also in C.
- The intersection of any two simplices from C is
either empty or is a face for both of them - If the maximal dimension of the constituting
simplices is n then the complex is called
n-complex.
35Upper/Closure approximations
- Let B(p), p ? V, be an elementary granule
-
- U(X) ?B(p) B(p) ? X ? (Pawlak)
- C(X) p B(p) ? X ? (Lin-topology)
36Upper/Closure approximations
-
- Cl(X) ?iCi(X) (Sierpenski-topology)
- Where Ci(X) C((C(X)))
- (transfinite steps) Cl(X) is closed.
37 New View
- Divide (and Conquer)
- Partition of set ?(generalize) ?
- Partition of B-space
- (topological partition)
38New ViewB-space
- The pair (V, B) is the universe, namely
- an object is a pair (p, B(p))
-
- where B V ? 2V ? p ?? B(p) is a
granulation
39Derived Partitions
- The inverse images of B is a partition (an
equivalence relation) - C Cp Cp B 1 (Bp) p ? V
-
40Derived Partitions
- Cp is called the center class of Bp
-
- A member of Cp is called a center.
41Derived Partitions
- The center class Cp consists of all the points
that have the same granule - Center class Cp q Bq Bp
42C-quotient set
- The set of center classes Cp is a quotient
- set
Iran, Iraq. .
US, UK, . . .
Russia, Korea
43 New Problem Solving Paradigm
- (Divide and) Conquer
- Quotient set ?
- Topological Quotient space
44Neighborhood of center class
- C (in the case B is not reflexive)
B-granule/neighborhood
C-classes
C-classes
45Neighborhood of center class
C-classes
B-granule
C-classes
46Topological partition
Cp -classes
Cp -classes
47 New Problem Solving Paradigm
- (Divide and) Conquer
- Quotient set ?
- Topological Quotient space
48Topological partition
Cp -classes
Cp -classes
49Topological partition
Cp -classes
Cp -classes
50Topological partition
Cp -classes
Cp -classes
51Topological Table (2-column)
2-columns 2-columns 2-columns Binary relation for Column I Binary relation for Column I
US ? US ? CX West West CX CX CY (? BX)
UK ? UK ? CX West West CX CX CZ ( ? BX)
Iran ? Iran ? CY M-east M-east CY CY CX (? BY)
Iraq ? Iraq ? CY M-east M-east CY CY CZ ( ? BY)
Russia ? Russia ? Cz East East CZ CZ CX ( ? Bz)
Korea ? Korea ? Cz East East CZ CZ Cy ( ? Bz)
52Future Direction
- Topological Reduct
- Topological Table processing
53Application 1 CWSP
-
- In UK, a financial service company may consulted
by competing companies. Therefore it is vital
to have a lawfully enforceable security policy. -
-
- 3
54Background
- Brewer and Nash (BN) proposed Chinese Wall
Security Policy Model (CWSP) 1989 for this
purpose -
55Policy Simple CWSP (SCWSP)
- "Simple Security", BN asserted that
- "people (agents) are only allowed
- access to information which is not
- held to conflict with any other
- information that they (agents)
- already possess."
56A little Fomral
- Simple CWSP(SCWSP)
- No single agent can read data X and Y
- that are in CONFLICT
57Formal SCWSP
- SCWSP says that a system is secure, if
- (X, Y) ? CIR ? X NDIF Y
- CIRConflict of Interests Binary Relation
- NDIFNo direct information flow
58Formal Simple CWSP
- SCWSP says that a system is secure, if
- (X, Y) ? CIR ? X NDIF Y
- (X, Y) ? CIR ? X DIF Y
- CIRConflict of Interests Binary Relation
59More Analysis
- SCWSP requires no single agent can read X and Y,
- but do not exclude the possibility a sequence of
agents may read them - Is it secure?
60Aggressive CWSP (ACWSP)
- The Intuitive Wall Model implicitly requires No
sequence of agents can read X and Y -
- A0 reads XX0 and X1,
- A1 reads X1 and X1,
- . . .
- An reads XnY
61Composite Information flow
- Composite Information flow(CIF) is
- a sequence of DIFs , denoted by ?
- such that
- XX0 ?X1 ? . . . ? XnY
- And we write X CIF Y
- NCIF No CIF
-
62Composition Information Flow
- Aggressive CWSP says that a system is secure, if
- (X, Y) ? CIR ? X NCIF Y
- (X, Y) ? CIR ? X CIF Y
-
63The Problem
-
- Simple CWSP ? ? Aggressive CWSP
- This is a malicious Trojan Horse problem
64Need ACWSP Theorem
- Theorem If CIR is anti-reflexive, symmetric and
anti-transitive, then - Simple CWSP ? Aggressive CWSP
65C and CIR classes
- CIR Anti-reflexive, symmetric, anti-transitive
Cp -classes
CIR-class
Cp -classes
66 Application 2
- Association mining by Granular/Bitmap computing
67Fundamental Theorem
- Theorem 1
- All isomorphic relations have isomorphic
patterns -
68IllustrationsTable K
v1 ? TWENTY MAR NY)
v2 ? TEN MAR SJ)
v3 ? TEN FEB NY)
v4 ? TEN FEB LA)
v5 ? TWENTY MAR SJ)
v6 ? TWENTY MAR SJ)
v7 ? TWENTY APR SJ)
v8 ? THIRTY JAN LA)
v9 ? THIRTY JAN LA)
69Illustrations Table K
v1 ? 20 3rd New York)
v2 ? 10 3rd San Jose)
v3 ? 10 2nd New York)
v4 ? 10 2nd Los Angels)
v5 ? 20 3rd San Jose)
v6 ? 20 3rd San Jose)
v7 ? 20 4th San Jose)
v8 ? 30 1st Los Angels)
v9 ? 30 1st Los Angels)
70Illustrations Patterns in K
v1 ? TWENTY MAR NY)
v2 ? TEN MAR SJ)
v3 ? TEN FEB NY)
v4 ? TEN FEB LA)
v5 ? TWENTY MAR SJ)
v6 ? TWENTY MAR SJ)
v7 ? TWENTY APR SJ)
v8 ? THIRTY JAN LA)
v9 ? THIRTY JAN LA)
71Isomorphic 2-Associations
K Count K
(TWENTY, MAR) 3 (20, 3rd)
(MAR, SJ) 3 (3rd, San Jose)
(TWENTY, SJ) 3 (20, San Jose)
72Canonical Model
- Bitmaps in Granular Forms
- Patterns in Granular Forms
73Table K
v1 ? 20 3rd
v2 ? 10 3rd
v3 ? 10 2nd
v4 ? 10 2nd
v5 ? 20 3rd
v6 ? 20 3rd
v7 ? 20 4th
v8 ? 30 1st
v9 ? 30 1st
74Illustration K?GDM
K GDM
v1 ? 20 3rd v1 v5 v6 v7 v1 v2 v5 v6
v2 ? 10 3rd v2 v3 v4 v1 v2 v5 v6
v3 ? 10 2nd v2 v3 v4 v3 v4
v4 ? 10 2nd v2 v3 v4 v3 v4
v5 ? 20 3rd v1 v5 v6 v7 v1 v2 v5 v6
v6 ? 20 3rd v1 v5 v6 v7 v1 v2 v5 v6
v7 ? 20 4th v1 v5 v6 v7 v7
v8 ? 30 1st v8 v9 v8 v9
v9 ? 30 1st v8 v9 v8 v9
75Illustration K?GDM
K GDM
v1 ? 20 3rd (100011100) (110011000)
v2 ? 10 3rd (011100000) (110011000)
v3 ? 10 2nd (011100000) (001100000)
v4 ? 10 2nd (011100000) (001100000)
v5 ? 20 3rd (100011100) (110011000)
v6 ? 20 3rd (100011100) (110011000)
v7 ? 20 4th (100011100) (110011000)
v8 ? 30 1st (000000011) (000000011)
v9 ? 30 1st (000000011) (000000011)
76 Granular Data Model (of K )
NAME Elementary Granules
10 (011100000)v2 v3 v4
20 (100011100) v1 v5 v6 v7
30 (000000011)v8 v9
1st (000000011)v8 v9
2nd (001100000)v3 v4
3rd (110011000)v1 v2 v5 v6
4th (110011000)v7
77Associations in Granular Forms
K Cardinality of Granules
(20, 3rd) v1 v5 v6 v7 ? v1 v2 v5 v6 v1 v5 v6 3
(10, 2nd) v2 v3 v4 ? v3 v4 v3 v4 2
(30, 1st) v8 v9 ? v8 v9 v8 v9 2
78Associations in Granular Forms
K Cardinality of Granules
(20, 3rd) v1 v5 v6 v7 ? v1 v2 v5 v6 v1 v5 v6 3
(3rd, SJ) v1 v2 v5 v6 ?v2 v5 v6 v7 v2 v5 v6 3
(20, SJ) v1 v5 v6 v7 ?v2 v5 v6 v7 v5 v6 v7 3
79Fundamental Theorems
- 1. All isomorphic relations are isomorphic to the
canonical model (GDM) - 2. A granule of GDM is a high frequency pattern
if it has high support.
80Relation Lattice Theorems
- 1. The granules of GDM generate a lattice of
granules with join ? and meet?. - This lattice is called Relational Lattice by Tony
Lee (1983) - 2. All elements of lattice can be written as
join of prime (join-irreducible elements) - (Birkoff MacLane, 1977, Chapter 11)
81 Find Association by Linear Inequalities
- Theorem. Let P1, P2, ? are primes
(join-irreducible) in the Canonical Model. then - Gx1 P1? x2 P2 ? ?
-
- is a High Frequency Pattern, If
- G x1 P1 x2 P2 ? ? th,
- (xj is binary number)
82 Join-irreducible elements
10?1st v2 v3 v4?v8 v9?
20 ?1st v1 v5 v6 v7 ?v8 v9?
30 ?1s v8 v9 ?v8 v9 v8 v9
10 ?2nd v2 v3 v4 ?v3 v4 v3 v4
20 ?2nd v1 v5 v6 v7 ?v3 v4?
30 ?2nd v8 v9 ?v3 v4?
10 ?3rd v2 v3 v4?v1 v2 v5 v6 v2
20 ?3rd v1v5v6v7?v1 v2 v5 v6 v1 v5 v6
30 ?3rd v8 v9 ?v1 v2 v5 v6?
10 ?4th v2 v3 v4 ?v7?
20 ?4th v1 v5 v6 v7 ?v7 v7
30 ?4th v8 v9?v7?
83AM by Linear Inequalities
- x1v1v5v6(20, 3rd)
- x2v2 (10, 3rd)
- x3v3v4(10, 2nd)
- x4v7 (20, 4th)
- x5v8v9 (30, 1st)
-
- x13x21x32x41 x52
84AM by Linear Inequalities
- x1v1v5v6x2v2x3v3v4x4v7x5v8v9
- x13x21x32x41 x52
- 1. x11
- 2. x2 1, x3 1, or x2 1, x5 1
- 3. x3 1, x4 1 or x3 1, x5 1
- 4. x4 1, x5 1
85AM by Linear Inequalities
- x1v1v5v6x2v2x3v3v4x4v7x5v8v9
- x13x21x32x41 x52
- 1. x11
- 1v1v5v6 133
- (20, 3rd) v1 v5 v6 v7 ? v1 v2 v5 v6
- v1 v5 v6 3
86AM by Linear Inequalities
- x1v1v5v6x2v2x3v3v4x4v7x5v8v9
- x13x21x32x41 x52
- x2 1, x3 1, or x2 1, x5 1
- x2v2x3v3v4 (10?20, 3rd)
- x2v2x5v8v9 (10, 2nd) ? (10, 3rd)
- x3 1, x4 1 or x3 1, x5 1
- x4 1, x5 1
87AM by Linear Inequalities
- x1v1v5v6x2v2x3v3v4x4v7x5v8v9
- x13x21x32x41 x52
- x3 1, x4 1 or x3 1, x5 1
- x3v3v4x4v7 (10, 2nd ? 3rd)
- x3v3v4x5v8v9 (10, 2nd) ? (30, 1st)
- x4 1, x5 1
88AM by Linear Inequalities
- x1v1v5v6x2v2x3v3v4x4v7x5v8v9
- x13x21x32x41 x52
- x4 1, x5 1
- x3v3v4x5v8v9 (20, 4st) ? (30, 1st)