Granulating the Semantics Space of Web Documents - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

Granulating the Semantics Space of Web Documents

Description:

A set of documents is associated with a Matrix, called Latent Semantic Index, ... All isomorphic relations have isomorphic patterns. 68. Illustrations:Table K. LA) JAN ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 89
Provided by: Ty53
Category:

less

Transcript and Presenter's Notes

Title: Granulating the Semantics Space of Web Documents


1
Granulating the Semantics Space of Web Documents
  • Tsau Young (T. Y.) Lin
  • Computer Science Department, San Jose State
    University
  • San Jose, CA 95192-0249, USA
  • tylin_at_cs.sjsu.edu
  • and
  • I-Jen Chiang
  • Graduate Institute of Medical Informatic,
  • Taipei Medical University, Taipei, Taiwan 110
  • ijchiang_at_tmu.edu.tw

2
Main results
  • A set of documents is associated with a Matrix,
    called Latent Semantic Index, Then by treating
    the row vectors as Euclidean space points, The
    document is clustered(categorized)
  • polyhedron, the association is believed to be
    one-to-one
  • Corollary A set of English documents and their
    Chinese translations can be identified via their
    semantics automatically.

3
Main results
  • A set of documents is associated with a
    polyhedron, the association is believed to be
    near one-to-one
  • Corollary A set of English documents and their
    Chinese translations can be identified via their
    semantics automatically.

4
Main results
  • This is identified by semantics,as there is no
    explicit correspondence between two sets of
    documents.

5
Outline
  • 1. Introduction
  • Domain Information Ocean
  • Methodology Granular Computing
  • Reaults
  • 2. Intuitive View of Granular Computing
  • 3. A Formal Theory
  • 4.
  • 2

6
Current State
  • Current search engines are syntactic based
    systems, they often return many meaningless web
    pages
  • Cause Inadequate semantic analysis, and lack of
    semantic based organization of information ocean.

7
Information Ocean
  • Internet is an information ocean.
  • It needs a methodology to navigate.
  • A new methodology-Granular Computing

8
Granular Computing-a methodology
  • The term granular computing is first used to
    label a subset of Zadehs
  • granular mathematics as my research area in
    BISC, 1996-97
  • (Zadeh, L.A. (1998) Some reflections on soft
    computing, granular computing and their roles in
    the conception, design and utilization of
    information/intelligent systems, Soft Computing,
    2, 23-25.)

9
Granular computing
  • Since, then, it has grown into an active research
    area
  • books, sessions, workshops
  • (Zhong, Lin was the first independent conference
    using
  • Name GrC there has several in JCIS)
  • IEEE task force

10
Granular Computing
  • Granulation seems to be a natural
    problem-solving methodology deeply rooted in
    human thinking.
  • Human body has been granulated into head, neck,
    and etc.

11
Granulating Information Ocean
  • In this talk, we will explain how we granulate
    the semantic space of information ocean that
    consists of millions of web pages

12
Organizing Information Ocean
  • How to organize the information ocean?
  • Considering the Semantics Space

13
Latent Semantic Space
  • A set of documents/web pages carries certain
    human thoughts. We will call the totality of
    these thoughts
  • Latent semantic space (LSS)
  • (recall Latent Semantic Index(LSI)

14
Classification clustering
  • In data mining,
  • a classification means identify an unseen object
    with one of the known classes in a partition
  • Clustering means classify a set of object into
    disjoint classes based on similarity, distance,
    and etc. the key ingredient here is the classes
    are not known apriori.

15
Categorizing Information
  • Multiple concepts can simultaneously exist in a
    single web page, So to organize web pages, a
    powerful
  • Clustering
  • method is needed.
  • (The of concepts can not be known apriori)

16
Latent Semantic Space(LSS)
  • The simplest representations of LSS?
  • A Set of Keywords
  • LSI

17
Latent Semantic Index
Key1 Key2 KeyN
Doc1 TFIDF1 TFIDF2 TFIDFn
Doc2

TFIDF


DocM TFIDF
18
TFIDF
  • Definition 1. Let Tr denote a collection of
    documents. The significance of a term ti in
  • a document dj in Tr is its TFIDF value
    calculated by the function tfidf(ti, dj), which
    is equivalent to the value tf(ti, dj) idf(ti,
    dj). It can be calculated as
  • TFIDF(ti dj)tf(ti dj) log Tr/Tr(ti)

19
TFIDF
  • where Tr(ti) denotes the number of documents in
    Tr in which ti occurs at least once,
  • 1 log(N(ti dj)) if N(ti
    dj) gt 0
  • tf(ti dj)
  • 0 otherwise
  • where N(ti, dj) denotes the frequency of terms ti
    occurs in document dj by counting all its nonstop
    words.

20
TFIDF
  • where Tr(ti) denotes the number of documents in
    Tr in which ti occurs at least once,
  • 1 log(N(ti dj)) if N(ti
    dj) gt 0
  • tf(ti dj)
  • 0 otherwise
  • where N(ti, dj) denotes the frequency of terms ti
    occurs in document dj by counting all its nonstop
    words.

21
Latent Semantic Index
  • Treat each row as a point in Euclidean space.
    Clustering such a set of points is a common
    approach (using SVD)
  • Note that the points has very little to do with
    the semantic of documents

22
Topological Space of LSS
  • Euclidean space has many metics but has only one
    topology
  • We will use this one

23
Keywords (0-Association)
  • 1. Given by Experts
  • 2. High TFIDF is a Keyword
  • Wall, Door. . ., Street, Ave

24
Keywords Pairs (1-Association)
  • 1-association
  • (Wall, Street) ? financial notion,
  • that nothing to do with the two vertices, Wall
    and Street

25
Keywords Pairs (1-Association)
  • 1-association
  • (White, House) ?
  • that nothing to do with the two vertices, White
    and House

26
Keywords Pairs (1-Association)
  • 1-association
  • (Neural, Network) ?
  • that nothing to do with the two vertices, Wall
    and Street

27
Geometric Analogy-1- Simplex
  • (open) 1-simplex
  • (v0,v1) ? open segment
  • (Wall, Street) ? financial notion,
  • End points (boundaries) are not included

28
Keywords are abstract vertices
  • LSS of Documents/web pages
  • ? Simplicial Complex
  • A special Hypergraph
  • Polyhedron ? Simplicial Complex

29
r-Association
  • r-association
  • Similarly r-association represents some semantic
    generated by a set of r keywords, moreover the
    semantics may have nothing to do with the
    individual keywords
  • There are mathematical structure that reflects
    such properties see next

30
Topology(Open) Simplex
  • 1-simplex open segment (v0,v1)
  • 2-simplex open triangle (v0,v1, v2)
  • 3-simplex open tetrahedron (v0,v1, v2 , v3)
  • All boundaries are not included

31
Topology (Open) Simplex
  • A (open) r-simplex is the generalization of those
    low dimensional simplexes (segment, triangle
    and tetrahedron) to high dimensional analogy in
    r-space (Euclidean spaces of dimension r)
  • Theorem. r-simplex uniquely determines the r1
    linearly independent vertices, and vice versa

32
Face
  • The convex hull of any m vertices of the
    r-simplex is called an m-face.
  • The 0-faces are the vertices, the 1-faces are the
    edges, 2-faces are triangles, and the single
    r-face is the whole r-simplex itself.

33
A line segment where two faces of a polyhedron
meet, also called a side.
34
n-Complex
  • A simplicial complex C is a finite set of
    simplices such that
  • Any face of a simplex from C is also in C.
  • The intersection of any two simplices from C is
    either empty or is a face for both of them
  • If the maximal dimension of the constituting
    simplices is n then the complex is called
    n-complex.

35
Upper/Closure approximations
  • Let B(p), p ? V, be an elementary granule
  • U(X) ?B(p) B(p) ? X ? (Pawlak)
  • C(X) p B(p) ? X ? (Lin-topology)

36
Upper/Closure approximations
  • Cl(X) ?iCi(X) (Sierpenski-topology)
  • Where Ci(X) C((C(X)))
  • (transfinite steps) Cl(X) is closed.

37
New View
  • Divide (and Conquer)
  • Partition of set ?(generalize) ?
  • Partition of B-space
  • (topological partition)

38
New ViewB-space
  • The pair (V, B) is the universe, namely
  • an object is a pair (p, B(p))
  • where B V ? 2V ? p ?? B(p) is a
    granulation

39
Derived Partitions
  • The inverse images of B is a partition (an
    equivalence relation)
  • C Cp Cp B 1 (Bp) p ? V

40
Derived Partitions
  • Cp is called the center class of Bp
  • A member of Cp is called a center.

41
Derived Partitions
  • The center class Cp consists of all the points
    that have the same granule
  • Center class Cp q Bq Bp

42
C-quotient set
  • The set of center classes Cp is a quotient
  • set

Iran, Iraq. .
US, UK, . . .
Russia, Korea
43
New Problem Solving Paradigm
  • (Divide and) Conquer
  • Quotient set ?
  • Topological Quotient space

44
Neighborhood of center class
  • C (in the case B is not reflexive)

B-granule/neighborhood
C-classes
C-classes
45
Neighborhood of center class

C-classes
B-granule
C-classes
46
Topological partition
  • B-granule/neighborhood

Cp -classes
Cp -classes
47
New Problem Solving Paradigm
  • (Divide and) Conquer
  • Quotient set ?
  • Topological Quotient space

48
Topological partition
  • B-granule/neighborhood

Cp -classes
Cp -classes
49
Topological partition
  • B-granule/neighborhood

Cp -classes
Cp -classes
50
Topological partition
  • B-granule/neighborhood

Cp -classes
Cp -classes
51
Topological Table (2-column)
2-columns 2-columns 2-columns Binary relation for Column I Binary relation for Column I
US ? US ? CX West West CX CX CY (? BX)
UK ? UK ? CX West West CX CX CZ ( ? BX)

Iran ? Iran ? CY M-east M-east CY CY CX (? BY)
Iraq ? Iraq ? CY M-east M-east CY CY CZ ( ? BY)

Russia ? Russia ? Cz East East CZ CZ CX ( ? Bz)
Korea ? Korea ? Cz East East CZ CZ Cy ( ? Bz)
52
Future Direction
  • Topological Reduct
  • Topological Table processing

53
Application 1 CWSP
  • In UK, a financial service company may consulted
    by competing companies. Therefore it is vital
    to have a lawfully enforceable security policy.
  • 3

54
Background
  • Brewer and Nash (BN) proposed Chinese Wall
    Security Policy Model (CWSP) 1989 for this
    purpose

55
Policy Simple CWSP (SCWSP)
  • "Simple Security", BN asserted that
  • "people (agents) are only allowed
  • access to information which is not
  • held to conflict with any other
  • information that they (agents)
  • already possess."

56
A little Fomral
  • Simple CWSP(SCWSP)
  • No single agent can read data X and Y
  • that are in CONFLICT

57
Formal SCWSP
  • SCWSP says that a system is secure, if
  • (X, Y) ? CIR ? X NDIF Y
  • CIRConflict of Interests Binary Relation
  • NDIFNo direct information flow

58
Formal Simple CWSP
  • SCWSP says that a system is secure, if
  • (X, Y) ? CIR ? X NDIF Y
  • (X, Y) ? CIR ? X DIF Y
  • CIRConflict of Interests Binary Relation

59
More Analysis
  • SCWSP requires no single agent can read X and Y,
  • but do not exclude the possibility a sequence of
    agents may read them
  • Is it secure?

60
Aggressive CWSP (ACWSP)
  • The Intuitive Wall Model implicitly requires No
    sequence of agents can read X and Y
  • A0 reads XX0 and X1,
  • A1 reads X1 and X1,
  • . . .
  • An reads XnY

61
Composite Information flow
  • Composite Information flow(CIF) is
  • a sequence of DIFs , denoted by ?
  • such that
  • XX0 ?X1 ? . . . ? XnY
  • And we write X CIF Y
  • NCIF No CIF

62
Composition Information Flow
  • Aggressive CWSP says that a system is secure, if
  • (X, Y) ? CIR ? X NCIF Y
  • (X, Y) ? CIR ? X CIF Y

63
The Problem
  • Simple CWSP ? ? Aggressive CWSP
  • This is a malicious Trojan Horse problem

64
Need ACWSP Theorem
  • Theorem If CIR is anti-reflexive, symmetric and
    anti-transitive, then
  • Simple CWSP ? Aggressive CWSP

65
C and CIR classes
  • CIR Anti-reflexive, symmetric, anti-transitive

Cp -classes
CIR-class
Cp -classes
66
Application 2
  • Association mining by Granular/Bitmap computing

67
Fundamental Theorem
  • Theorem 1
  • All isomorphic relations have isomorphic
    patterns

68
IllustrationsTable K
v1 ? TWENTY MAR NY)
v2 ? TEN MAR SJ)
v3 ? TEN FEB NY)
v4 ? TEN FEB LA)
v5 ? TWENTY MAR SJ)
v6 ? TWENTY MAR SJ)
v7 ? TWENTY APR SJ)
v8 ? THIRTY JAN LA)
v9 ? THIRTY JAN LA)
69
Illustrations Table K
v1 ? 20 3rd New York)
v2 ? 10 3rd San Jose)
v3 ? 10 2nd New York)
v4 ? 10 2nd Los Angels)
v5 ? 20 3rd San Jose)
v6 ? 20 3rd San Jose)
v7 ? 20 4th San Jose)
v8 ? 30 1st Los Angels)
v9 ? 30 1st Los Angels)
70
Illustrations Patterns in K
v1 ? TWENTY MAR NY)
v2 ? TEN MAR SJ)
v3 ? TEN FEB NY)
v4 ? TEN FEB LA)
v5 ? TWENTY MAR SJ)
v6 ? TWENTY MAR SJ)
v7 ? TWENTY APR SJ)
v8 ? THIRTY JAN LA)
v9 ? THIRTY JAN LA)
71
Isomorphic 2-Associations
K Count K
(TWENTY, MAR) 3 (20, 3rd)
(MAR, SJ) 3 (3rd, San Jose)
(TWENTY, SJ) 3 (20, San Jose)
72
Canonical Model
  • Bitmaps in Granular Forms
  • Patterns in Granular Forms

73
Table K
v1 ? 20 3rd
v2 ? 10 3rd
v3 ? 10 2nd
v4 ? 10 2nd
v5 ? 20 3rd
v6 ? 20 3rd
v7 ? 20 4th
v8 ? 30 1st
v9 ? 30 1st
74
Illustration K?GDM
K GDM
v1 ? 20 3rd v1 v5 v6 v7 v1 v2 v5 v6
v2 ? 10 3rd v2 v3 v4 v1 v2 v5 v6
v3 ? 10 2nd v2 v3 v4 v3 v4
v4 ? 10 2nd v2 v3 v4 v3 v4
v5 ? 20 3rd v1 v5 v6 v7 v1 v2 v5 v6
v6 ? 20 3rd v1 v5 v6 v7 v1 v2 v5 v6
v7 ? 20 4th v1 v5 v6 v7 v7
v8 ? 30 1st v8 v9 v8 v9
v9 ? 30 1st v8 v9 v8 v9
75
Illustration K?GDM
K GDM
v1 ? 20 3rd (100011100) (110011000)
v2 ? 10 3rd (011100000) (110011000)
v3 ? 10 2nd (011100000) (001100000)
v4 ? 10 2nd (011100000) (001100000)
v5 ? 20 3rd (100011100) (110011000)
v6 ? 20 3rd (100011100) (110011000)
v7 ? 20 4th (100011100) (110011000)
v8 ? 30 1st (000000011) (000000011)
v9 ? 30 1st (000000011) (000000011)
76

Granular Data Model (of K )
NAME Elementary Granules
10 (011100000)v2 v3 v4
20 (100011100) v1 v5 v6 v7
30 (000000011)v8 v9
1st (000000011)v8 v9
2nd (001100000)v3 v4
3rd (110011000)v1 v2 v5 v6
4th (110011000)v7
77
Associations in Granular Forms
K Cardinality of Granules
(20, 3rd) v1 v5 v6 v7 ? v1 v2 v5 v6 v1 v5 v6 3
(10, 2nd) v2 v3 v4 ? v3 v4 v3 v4 2
(30, 1st) v8 v9 ? v8 v9 v8 v9 2
78
Associations in Granular Forms
K Cardinality of Granules
(20, 3rd) v1 v5 v6 v7 ? v1 v2 v5 v6 v1 v5 v6 3
(3rd, SJ) v1 v2 v5 v6 ?v2 v5 v6 v7 v2 v5 v6 3
(20, SJ) v1 v5 v6 v7 ?v2 v5 v6 v7 v5 v6 v7 3
79
Fundamental Theorems
  • 1. All isomorphic relations are isomorphic to the
    canonical model (GDM)
  • 2. A granule of GDM is a high frequency pattern
    if it has high support.

80
Relation Lattice Theorems
  • 1. The granules of GDM generate a lattice of
    granules with join ? and meet?.
  • This lattice is called Relational Lattice by Tony
    Lee (1983)
  • 2. All elements of lattice can be written as
    join of prime (join-irreducible elements)
  • (Birkoff MacLane, 1977, Chapter 11)

81
Find Association by Linear Inequalities
  • Theorem. Let P1, P2, ? are primes
    (join-irreducible) in the Canonical Model. then
  • Gx1 P1? x2 P2 ? ?
  • is a High Frequency Pattern, If
  • G x1 P1 x2 P2 ? ? th,
  • (xj is binary number)

82

Join-irreducible elements
10?1st v2 v3 v4?v8 v9?
20 ?1st v1 v5 v6 v7 ?v8 v9?
30 ?1s v8 v9 ?v8 v9 v8 v9
10 ?2nd v2 v3 v4 ?v3 v4 v3 v4
20 ?2nd v1 v5 v6 v7 ?v3 v4?
30 ?2nd v8 v9 ?v3 v4?
10 ?3rd v2 v3 v4?v1 v2 v5 v6 v2
20 ?3rd v1v5v6v7?v1 v2 v5 v6 v1 v5 v6
30 ?3rd v8 v9 ?v1 v2 v5 v6?
10 ?4th v2 v3 v4 ?v7?
20 ?4th v1 v5 v6 v7 ?v7 v7
30 ?4th v8 v9?v7?
83
AM by Linear Inequalities
  • x1v1v5v6(20, 3rd)
  • x2v2 (10, 3rd)
  • x3v3v4(10, 2nd)
  • x4v7 (20, 4th)
  • x5v8v9 (30, 1st)
  • x13x21x32x41 x52

84
AM by Linear Inequalities
  • x1v1v5v6x2v2x3v3v4x4v7x5v8v9
  • x13x21x32x41 x52
  • 1. x11
  • 2. x2 1, x3 1, or x2 1, x5 1
  • 3. x3 1, x4 1 or x3 1, x5 1
  • 4. x4 1, x5 1

85
AM by Linear Inequalities
  • x1v1v5v6x2v2x3v3v4x4v7x5v8v9
  • x13x21x32x41 x52
  • 1. x11
  • 1v1v5v6 133
  • (20, 3rd) v1 v5 v6 v7 ? v1 v2 v5 v6
  • v1 v5 v6 3

86
AM by Linear Inequalities
  • x1v1v5v6x2v2x3v3v4x4v7x5v8v9
  • x13x21x32x41 x52
  • x2 1, x3 1, or x2 1, x5 1
  • x2v2x3v3v4 (10?20, 3rd)
  • x2v2x5v8v9 (10, 2nd) ? (10, 3rd)
  • x3 1, x4 1 or x3 1, x5 1
  • x4 1, x5 1

87
AM by Linear Inequalities
  • x1v1v5v6x2v2x3v3v4x4v7x5v8v9
  • x13x21x32x41 x52
  • x3 1, x4 1 or x3 1, x5 1
  • x3v3v4x4v7 (10, 2nd ? 3rd)
  • x3v3v4x5v8v9 (10, 2nd) ? (30, 1st)
  • x4 1, x5 1

88
AM by Linear Inequalities
  • x1v1v5v6x2v2x3v3v4x4v7x5v8v9
  • x13x21x32x41 x52
  • x4 1, x5 1
  • x3v3v4x5v8v9 (20, 4st) ? (30, 1st)
Write a Comment
User Comments (0)
About PowerShow.com