CSE 450 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE 450

Description:

Finding the most descriptive terms for a particular document in a collection of ... Automatic Keyword Extraction Given More Linguistic Knowledge, Annette Hulth, ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 12
Provided by: osamaah
Category:
Tags: cse | annette

less

Transcript and Presenter's Notes

Title: CSE 450


1
CSE 450 Web Mining SeminarProfessor Brian D.
DavisonFall 2005
  • A Project Presentation on
  • Identifying most descriptive terms
  • by
  • Osama Ahmed Khan
  • 12/16/2005

2
Problem
  • Finding the most descriptive terms for a
    particular document in a collection of documents
    (webpages)
  • Estimating the best description for a new
    location in a higher-dimensional space

3
Terminology
  • Term Adjective Noun (bi-gram) -- ti
  • Document Content -- di

4
Algorithm
  1. Creates a 2-D matrix A (t x d), representing the
    frequency of each term ti for each document di
  2. Creates a 3-D matrix B (d x t x t), representing
    the frequency of co-occurrence of each term ti
    with every other term tj for each document di
  3. Sorts the pairs titj for each document di in
    descending order of frequency, where titj
    represents the descriptive terms for that
    document di
  4. Extracts the first n pairs in the sorted index
    for each document di, where n represents the user
    input

5
Algorithm (contd.)
  1. A document is represented in a higher-dimensional
    space by plotting its t(t-1)/2 coordinates, where
    each dimension is a titj pair
  2. Any missing coordinate for a document di is
    assigned a value of zero
  3. A new document dj located in t(t-1)/2-dimensional
    space is best described by using Mahalanobis
    Distance metric to find the minimum distance
    between dj and (d-1) documents
  4. A new document dj identified in
    t(t-1)/2-dimensional space without its
    coordinates being known is best described by
    using k-Nearest Neighbors approach

6
Dataset
  • Xiaoguang Qi provided pre-processed data
  • http//wume.cse.lehigh.edu/xiq204/topics/

7
Implementation
  • Code
  • Text Mining Infrastructure (TMI)
  • http//hddi.cse.lehigh.edu
  • C
  • Metrics
  • Precision
  • Recall

8
Applications
  1. Topic Detection through search engines
  2. Finding document representation in different
    domains

9
Open Problems
  • Finding an approximate transformation from
    t-dimensional space to a new k-dimensional space
    (if any exists), when the set of documents D is
    also represented in k-dimensional space, where k
    is equal to t(t-1)/2 dimensions
  • Estimating the best description of a document in
    either of the two spaces when one set of space
    coordinates are missing

10
References
  • Improved Automatic Keyword Extraction Given More
    Linguistic Knowledge, Annette Hulth, Proceedings
    of the 2003 Conference on Empirical Methods in
    Natural Language Processing
  • Using Web Structure for Classifying and
    Describing Web Pages. E.J.Glover,
    K.Tsioutsiouliklis, S.Lawrence, D.M.Pennock
    G.W.Flake, WWW2002, Hawaii, USA
  • Lexically-Generated Subject Hierarchies for
    Browsing Large Collections, C.G.Nevill-Manning,
    I.H.Witten G.W.Paynter

11
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com