WIRED Week 3 - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

WIRED Week 3

Description:

Synonyms (uniterms!) Perfect vs. Relative. Only the words and immediate classifications ... Synonyms are Ok if there is no difference? ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 21
Provided by: bert189
Category:
Tags: wired | week

less

Transcript and Presenter's Notes

Title: WIRED Week 3


1
WIRED Week 3
  • Readings Overview IR History
  • Projects and/or Papers Topics
  • Mozilla Firefox

2
Developments in IR
  • Still similar goals
  • More automated, faster
  • A transition to more complex systems
  • Combinations of many early techniques
  • Link between IR and NLP (understanding language)
  • Only recently is Machine Learning (MT) gaining
    respectability
  • A focus on plain text, whole documents
  • Relationships between documents and words

3
Coordination Indexing
  • Pre fixed descriptions, whole documents
  • More controlled vocabulary
  • Longer, detailed descriptions of document derived
    by hand
  • Post variable combinations, user vocabulary
  • More specifics and document types
  • More flexible, less intervention into the system
  • Huge Implications for Indexing
  • More focus on natural language
  • Phrases, conditional use

4
Indexing Basics
  • Thesauri
  • Word frequency
  • Words
  • Stems
  • Phrases
  • Weighting schemes
  • Scalar
  • The beginning of ranking?
  • Applied, multivariate statistics
  • More available content

5
IR Statistics
  • The engine that runs IR
  • Are used to coordinate systems
  • Weights
  • Frequency
  • (Recency)
  • Are used to evaluate systems
  • Precision
  • Recall
  • Relevancy
  • Ranking

6
Whats more important?
  • Natural language queries or better indices?
  • Speed or accuracy?
  • Interface or document selection?
  • Easy update or structured data?
  • Find everything or eliminate redundancy?
  • Ideas or experiments?

7
Thesaurus Approach to IR
  • Vannevar Bush the Memex
  • Classification
  • Grouping
  • Understanding the grouping
  • Completeness consistency
  • Indexing
  • Terms
  • Concepts
  • Subject Matter
  • Synonyms (uniterms!)
  • Perfect vs. Relative
  • Only the words and immediate classifications
  • Applied or related words or classifications

8
Managing thesauri
  • Where do the terms come from?
  • Document
  • Related documents
  • Indexers
  • Classification trees
  • Why limit a thesaurus?
  • Precision vs. Recall
  • System performance
  • Ranking
  • Users
  • enquirer is asked to prepare an essay giving as
    many details p17
  • Write a document to find a document?

9
Thesaurus Precision
  • How accurate are words?
  • Multiple meanings
  • Context
  • Culture
  • Associated words (phrases)
  • Term abstracts
  • Developed by experts
  • Understood by experts (only?)
  • Synonyms are Ok if there is no difference?
  • Documents only represented in so many ways (holes
    in cards)

10
Structure of Information
  • So many documents with notational abstracts
  • Use more terms to draw distinctions
  • Frequency of occurrence
  • Match compare document frequencies (vectors)
  • Relationships
  • Relational Algebra for mapping words
  • Databases of representations
  • Extendable to other documents (relational)

11
Thesaurus Advantages
  • Descriptions of content
  • Not full abstracts
  • Extendable
  • Relevance scale(s)
  • Uniterms
  • Structural flexibility
  • Relational flexibility

12
Derivation from MR Texts
  • A drastic change in classification
  • No longer personal work
  • Systematic
  • Not based on a single opinion
  • As always, coping with information growth
  • Issues of language
  • By discipline
  • By culture
  • By era
  • What else besides IR can benefit from MR?

13
Getting more specific
  • Not just word frequency
  • Phrases
  • Sentences
  • Citations
  • Identifying elements their significance, a new
    kind of classification
  • Word lists by topic
  • Not just words content, but analysis between
    documents

14
Keywords in Context
  • Machine readable context
  • Phrase or word with the document
  • Alphabetical order (in this example)
  • Single or multiple occurrences
  • Preceding words or phrases associated with
    Keyword in Context
  • Coefficients of Similarity between documents
  • Chart on p 23
  • Most co-ef range from 1.00 to .00001
  • How documents related to each other
  • A B more so than C

15
Indexing Abstracting by Association
  • Is representing knowledge the goal?
  • Just organization
  • Reference
  • Represent more than current classification
    systems
  • More subtle
  • More granular?
  • The network of information as the organizing
    concept
  • Faceted classification
  • Industry
  • Geography

16
Whats the best scheme?
  • How many dimensions?
  • Building categories
  • Putting documents into categories
  • Working from text
  • Working from expert indexing
  • These in combination?
  • Co-occurrence
  • Letting the author define the context of the
    document
  • Classification
  • Abstracts and keywords

17
Measuring Associations
  • Co-occurring words
  • High frequency co-occurrences
  • Not just a match, but the degree of co-occurrence
  • Statistical measures
  • Normalization methods
  • Logarithmic
  • Factors

18
Measuring Associations
  • Co-occurring words
  • High frequency co-occurrences
  • Not just a match, but the degree of co-occurrence
  • Statistical measures
  • Normalization methods
  • Logarithmic
  • Factors

19
Break!
  • Working with Firefox
  • Basic interface modification
  • Making search easier
  • Automating search

20
Projects and/or Papers Overview
  • How can (Web) IR be better?
  • Better IR models
  • Better User Interfaces
  • More to find vs. easier to find
  • Scriptable applications
  • New interfaces for applications
  • New datasets for applications
Write a Comment
User Comments (0)
About PowerShow.com