Data Mining For Hypertext: A Tutorial Survey - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining For Hypertext: A Tutorial Survey

Description:

Hypertext - a collection of documents (or 'nodes') containing cross-references ... documents are merged into super documents or groups until only one group is left ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 60
Provided by: NEW94
Category:

less

Transcript and Presenter's Notes

Title: Data Mining For Hypertext: A Tutorial Survey


1
Data Mining For Hypertext A Tutorial Survey
11/11/01
sdbi winter 2001
  • Based on a paper by
  • Soumen Chakrabarti
  • Indian Institute Of technology Bombay.
  • Soumen_at_cse.iitb.ernet.in
  • Lecture by
  • Noga Kashti
  • Efrat Daum

2
Lets start with definitions
  • Hypertext - a collection of documents (or
    "nodes") containing cross-references or "links"
    which, with the aid of an interactive browser
    program, allow the reader to move easily from one
    document to another.
  • Data Mining - Analysis of data in a database
    using tools which look for trends or anomalies
    without knowledge of the meaning of the data.

3
Two Ways For Getting Information From The Web
  • Clicking On Hyperlinks
  • Searching Via Keyword Queries

4
Some History
  • Before the popular Web, Hypertext has been used
    by ACM, SIGIR, SIGLINK/SIGWEB and DIGITAL
    LIBRARIES.
  • The old IR (Information retrieval) deals with
    documents whereas the Web deals with
    semi-structured data.

5
Some Numbers ..
  • The Web exceeds 800 million HTML pages on about
    three million servers.
  • Almost a million pages are added daily.
  • A typical page changes in a few months.
  • Several hundred gigabytes change every month.

6
Difficulties With Accessing Information On The
Web
  • Usual problems of text search (synonymy,
    polysemy, text sensitivity) become much more
    severe.
  • Semi-structured data.
  • Sheer size and flux.
  • No consistent standard or style.

7
The Old Search Process Is Often Unsatisfactory!
  • Deficiency of scale.
  • Poor accuracy (low recall and low precision).

8
Better Solutions Data Mining And Machine Learning
  • NL Techniques.
  • Statistical Techniques for learning structure in
    various forms from text hypertext and
    semi-structured data.

9
Issues Well Discuss
  • Models
  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Social network analysis

10
Models For Text
  • Representation for text with statistical analyses
    only (bag-of-words)
  • The vector space model
  • The binary model
  • The multi-nominal model

11
Models For Text (cont.)
  • The vector space model
  • Documents -gt tokens-gtcanonical forms.
  • Canonical token is an axis in a Euclidean space.
  • The t-th coordinate of d is n(d,t)
  • t is a term
  • d is a document

12
The Vector Space Model Normalize The Document
Length To 1

13
More Models For Text
  • The Binary Model A document is a set of terms,
    which is a subset of the lexicon. Word counts are
    not significant.
  • The multinomial model a die with T faces.
    Every face has a probability ?t of showing up
    when tossed. Deciding of total word count, the
    author tosses the die while writing the term that
    shows up.

14
Models For Hypertext
  • Hypertext text with hyperlinks.
  • Varying levels of detail.
  • Example Directed Graph(D,L)
  • D The set of nodes/documents/pages
  • L The set of links

15
Models For Semi-structured Data
  • A point of convergence for the web(documents) and
    database(data) communities

16
Models For Semi-structured Data(cont.)
  • like Topic Directories with tree-structured
    hierarchies.
  • Examples Open Directory Project , Yahoo!
  • Another representation XML.

17
Supervised Learning (classification)
  • Algorithm Initialization training data, each
    item is marked with a label or class from a
    discrete finite set.
  • Input unlabeled data.
  • Algorithm roll guess the data labels.

18
Supervised Learning (cont.)
  • Example topic directories
  • Advantages help structure, restrict keyword
    search, can enable powerful searches.

19
Probabilistic Models For Text Learning
  • Let c1,,cm be m classes or topics with some
    training documents Dc.
  • Prior probability of a class
  • T the universe of terms in all the training
    documents.

20
Probabilistic Models For Text Learning (cont.)
  • Naive Bayes classification
  • Assumption for each class c, there is binary
    text generator model.
  • Model parameters Fc,t the probability that a
    document in class c will mention term t at lease
    once.

21
Naive Bayes classification (cont.)
  • Problems
  • short documents are discouraged.
  • Pr (dc) estimation is likely to be greatly
    distorted.

22
Naive Bayes classification (cont.)
  • With the multinomial model

23
Naive Bayes classification (cont.)
  • Problems
  • Again, short documents are
    discouraged.
  • Inter-term correlation ignored.
  • Multiplicative Fc,t surprise factor.
  • Conclusion
  • Both model are effective.

24
More Probabilistic Models For Text Learning
  • Parameter smoothing and feature
    selection.
  • Limited dependence modeling.
  • The maximum entropy technique.
  • Support vector machines (SVMs).
  • Hierarchies over class labels.

25
Learning Relations
  • Classification extension a combination of
    statistical and relational learning.
  • Improve accuracy.
  • The ability to invent predicates.
  • Can represent hyperlink graph structure and
    word statistics of neighbor documents.
  • Learned rules will not be dependent on specific
    keywords.

26
Unsupervised learning
  • hypertext documents
  • a hierarchy among the documents
  • What is a good clustering?

27
Basic clustering techniques
  • Techniques for Clustering
  • kmeans
  • hierarchical agglomerative clustering

28
Basic clustering techniques
  • documents
  • unweighted vector space
  • TFIDF vector space
  • similarity between two documents
  • cos(?), ? the angle between their
    corresponding vectors
  • the distance between the vectors lengths
  • (normalized)

29
kmeans clustering
  • the kmeans algorithm
  • input
  • d1,,dn - set of n documents
  • k - the number of clusters desired (k?n)
  • output
  • C1,,Ck k clusters with the n classifier
    documents

30
kmeans clustering
  • the kmeans algorithm (cont.)
  • initial guess k initial means m1,mk
  • Until there are no changes in any means
  • For each document d - d is in ci if d-mi is
    the minimum of all the k distances.
  • For 1?i?k - replace mi with the means of all the
    documents for ci.

31
kmeans clustering
  • the kmeans algorithm Example

K2
K3
32
kmeans clustering (cont.)
  • Problem
  • high dimensionality
  • e.g. if 30000 dimensions has only two possible
    values, the vector space size is 230000
  • Solution
  • Projecting out some dimensions

33
Agglomerative clustering
  • documents are merged into superdocuments or
    groups until only one group is left
  • Some definitions
  • the similarity between documents
    d1 and d2
  • the self-similarity of group A

34
Agglomerative clustering
  • The agglomerative clustering algorithm
  • input
  • d1,,dn - set of n documents
  • output
  • G the final group with a nested hierarchy

35
Agglomerative clustering (cont.)
  • The agglomerative clustering algorithm
  • Initial G G1,,Gn, where Gidi
  • while Ggt1
  • Find A and B in G such as s(A ? B) is maximized
  • G (G A,B) ? A ? B
  • Times O(n2)

36
Agglomerative clustering (cont.)
  • The agglomerative clustering algorithm
  • Example

37
Techniques from linear algebra
  • Documents and terms are represented by vectors
    in Euclidean space.
  • Applications of linear algebra to text analysis
  • Latent semantic indexing (LSI)
  • Random projections

38
Co-occurring terms
  • Exemple

39
Latent semantic indexing (LSI)
  • Vector Space model of documents
  • Let mT, the lexicon size
  • Let nthe number of documents
  • Define Amxn term-bydocument matrix
  • where aij the number of occurrences of term i
    in document j.

40
Latent semantic indexing (LSI)
  • How to reduce it?

41
Singular Value Decomposition (SVD)
  • Let A?Rmxn, m ? n be a matrix.
  • The singular value decomposition of A is the
    factorization AUDVT, where
  • U and V are orthogonals, UTUVTVIn
  • Ddiag(?1, ?n) with ?i?0, 1?i?n
  • then,
  • Uu1,un, u1,un are the left singular vectors
  • Vv1,vn, v1,vn are the right singular
    vectors
  • ?1, ?n are the singular values of A.

42
Singular Value Decomposition (SVD)
  • AAT(UDVT)(VDTUT)UDIDUTUD2UT
  • ? AATUUD2?12u1,,?n2un
  • for 1?i?n, AATui?i2ui
  • ? the columns of U are the eigenvectors of AAT.
  • Similary, ATAVD2VT
  • ? the columns of V are the eigenvectors of ATA.
  • The eigenvalues of AAT (or ATA) are ?12,,?n2

43
Singular Value Decomposition (SVD)
  • Let
  • be the k-truncated SVD.
  • rank(Ak)k
  • A-AK2 ?A-MK2 for any matrix Mk of rank k.

44
Singular Value Decomposition (SVD)
  • Note A, Ak ? Rmxn

45
LSI with SVD
  • Define q?Rm a query vector.
  • qi?0 if term i is a part of the query.
  • Then, ATq ?Rn, the answer vector.
  • (ATq)j?0 if document j contains one or more terms
    in the query.
  • How to do it better?

46
LSI with SVD
  • Use Ak instead of A
  • ? calculate AkTq
  • Now, query on car will return a document
    containing the word auto.

47
Random projections
  • Theorem
  • let
  • - a unit vector
  • H - a randomly oriented -dimensional subspace
    through the origin
  • X - random variable of the square of the length
    of the projection of v on H
  • then
  • and if is chosen between and
  • where

48
Random projections
  • A projection of a set of points to a randomly
    oriented subspace.
  • Small distortion in inter-points distances
  • The technique
  • reducing the dimensionality of the points
  • speed up the distances computation

49
Semi-supervised learning
  • Real-life applications
  • labeled documents
  • unlabeled documents
  • Between supervised and unsupervised learning

50
Learning from labeled and unlabeled documents
  • Expectation Maximization (EM) Algorithm
  • Initial train a naive Bayes classifier using
    only labeled data.
  • Repeat EM iteration until near convergence
  • Estep
  • Mstep assign class probabilities Pr(c/d) to all
    documents not labeled by the ?c,t estimates.
  • error is reduced by a third in the best cases.

51
Relaxation labeling
  • The hypertext model
  • documents are nodes in a hypertext graph.
  • There are other sources of information induced by
    the links.

52
Relaxation labeling
  • cclass, tterm, Nneighbors
  • In supervised learning Pr(tc)
  • In hypertext, using neighbors terms Pr(
    t(d),t(N(d)) c)
  • Better model, using neighbors classes Pr(
    t(d),c(N(d)) c
  • Circularity

53
Relaxation labeling
  • Resolve the circularity
  • Initial Pr(0)(cd) to each document d?N(d1)
    where d1 is a test document (use text-only)
  • Iterations

54
Social network analysis
  • Social networks
  • between academics by coauthoring, advising.
  • between movie personnel by directing and acting.
  • between people by making phone calls
  • between web pages by hyperlinking to other web
    pages.
  • Applications
  • Google
  • HITS

55
  • where
  • ? means link to
  • N total number of nodes in the Web graph
  • simulated a random walk on the web graph
  • used a score of popularity
  • the popularity score is precomputed independent
    of the query
  •  

56
Hyperlink induced topic search (HITS)
  • Depended on a search engine
  • For each node u in the graph calculated
    Authorities scores (au) and Hubs scores (hu)
  • Initialize huau1
  • Repeat until convergence
  • are normalized to 1

57
  • Interesting page include links to others
    interesting pages.
  • The goal
  • many relevant pages
  • few irrelevant pages
  • fast

58
Conclusion
  • Supervised learning
  • Probabilistic models
  • Unsupervised learning
  • Techniques for clustering
  • k-means (top-down)
  • agglomerative (bottom-up)
  • Techniques for reducing
  • LSI with SVD
  • Random projections
  • Semi-supervised learning
  • The EM algorithm
  • Relaxation labeling

59
referance
  • http//www.engr.sjsu.edu/knapp/HCIRDFSC/C/k_means
    .htm
  • http//ei.cs.vt.edu/cs5604/cs5604cnCL/CL-illus.ht
    ml
  • http//www.cs.utexas.edu/users/inderjit/Datamining
  • Scatter/Gather A Clusterbased Approach to
    Browsing Large Document Collections (Cutting,
    Karger, Pedersen, Tukey)
Write a Comment
User Comments (0)
About PowerShow.com