Automatic indexing - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Automatic indexing

Description:

When the assignment of content identifiers is carried out with the aid of modern ... Used to conflate, or reduce morphological variants to a single term ... – PowerPoint PPT presentation

Number of Views:784
Avg rating:3.0/5.0
Slides: 18
Provided by: janegre2
Category:

less

Transcript and Presenter's Notes

Title: Automatic indexing


1
Automatic indexing
  • Salton
  • When the assignment of content identifiers is
    carried out with the aid of modern computing
    equipment the operation becomes automatic
    indexing

2
Pros and Cons of Automatic Indexing
  • Pros
  • (Lots of stuff out there to index)
  • Consistency
  • Cost reduction
  • Time reduction
  • Cons / limitations
  • Human intellect
  • Term relationships
  • Misleading in retrieval
  • Good algorithms, but generally domain-specific

3
Approaches and Methods
  • Initial approach
  • Create an inverted file
  • Inverted file contains all the index terms
    automatically drawn from the document records
    according to the indexing technique used
  • On-the-fly (facilitate retrieval)
  • natural language processing (NLP)

4
  • Recall
  • No relevant doc. in DB retrieved
  • Total no. of relevant docs in DB
  • Retrieval of all the relevant documents from the
    whole database!!
  • Precision
  • No. of relevant doc. retrieved
  • Total no. of documents retrieved
  • Percentage of relevant documents retrieved.

5
Methods
  • Stemming
  • Used to conflate, or reduce morphological
    variants to a single term
  • Should increase recall by grouping similar
    terms
  • Achieve through stemming algorithms
  • Porter algorithm, simple S stemmer
  • Weakness, too severe
  • morals and more stemmed to more
  • Most effective general domain w/plurals, ing
    ed
  • Utility in retrieval, truncation

6
Automatic Classification/Indexing Methods
  • Statistical analysis
  • Based on theory that terms frequency determines
    aboutness informetric properties
  • High frequency / low frequency poor candidates
  • Simple to perform
  • Effective in creating index of single terms or
    broad class groupings
  • Phrase indexing more complicated, and difficult
    to render
  • E.g., running water highway traffic control
    in North Carolina
  • Domain sensitive (better in some than others)
  • Weighting options
  • Within document tf (term frequency)
  • Proportion of documents within the collection
    that have a term/s IDF (inverse document
    frequency)
  • Combination of tfIDF quite successful
  • Term discrimination

7
Methods
  • Syntactical analysis
  • Phrase (or grammatical) indexing
  • Intended to address ambiguities with statistical
    methods
  • Subject, object predicate
  • Rules complex, NLP, domain specific
  • Anaphoric statement problematic He ate apples
  • Research shows only a slight improvement over
    statistical methods
  • Probabilistic methods
  • Term is a good indicator is it appears in a
    relevant document
  • Basically tfIDF method, can include weighting
  • Clustering, bringing like documents together
    based on the above methods, integrate thesauri,
    or other tools
  • Classification/categorization, assignment of a
    top node (class notation or term)

8
Natural language vs. Controlled Vocabulary
  • Natural language continuum
  • lt-basic key word------------IR------------full
    NLP-gt
  • Full discourse
  • History
  • Uniterm approach (Taube, 1953), Optical
    coincidence cards, edge hold punch cards
  • Cranfield studies

9
Uniterm
10
Peek-a-boo card
11
Natural language vs. Controlled Vocab.
  • Pros
  • Cons
  • Production cost
  • Cost to the end-user
  • Facilitate specificity in terms of access
  • Exhaustivity indexing
  • Handling of errors

12
What is Automatic Classification?
  • Automatic manipulation of a documents contents
    to support logical grouping with other similar
    documents for organization and/or retrieval
    activities. Can include the assignment of, or
    manipulation of, classification notation.

13
Clustering techniques
  • A technique by which relationships among data
    elements such as documents or document
    attributes are determined and closely related
    elements are grouped into clusters. (Korfhage,
    1997)
  • Group vectors using various coefficients
  • E.g., Dice Coefficient

14
Other methods
  • Term weighting
  • Various other IR approaches, automatic indexing

15
Why Automatic Classification?
  • Classification is time consuming and expensive
  • Knowledge structuring
  • To much information
  • Status of automatic classification
  • Fairly experimental, although not completely
  • Operational systems for e-mail

16
Why Automatic Classification?
  • More expedient than a human
  • Less costly /??
  • More consistent than human
  • Easier to fix errors

17
Why Automatic Classification?
  • Your articles.
  • How defined
  • What was the purpose
  • How was the automatic classification done, or
    discussed
  • Outcome
  • Experimental / Operational setting?
Write a Comment
User Comments (0)
About PowerShow.com