Digital Libraries - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Digital Libraries

Description:

Crazy: B. Fox: A, B. Quick: A. Lecture 1. Information Retrieval ... How many web pages link to this page? Success of solution depends on quality of features! ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 31
Provided by: hpl5
Category:

less

Transcript and Presenter's Notes

Title: Digital Libraries


1
Digital Libraries
  • Carl Staelin

2
Motivation
  • Enormous amounts of information
  • 1GB email
  • 100MB compressed papers
  • Hundreds of bookmark/favorite web pages
  • Internet and intranet (1.4G pages and counting!)
  • ???
  • BUT information is useless unless you can find
    it!

3
Disk space trends
  • To a first approximation, the amount of data
    equals the amount of available disk space
  • Disk capacity is growing exponentially
  • /megabyte is decreasing exponentially
  • Amount of data is growing exponentially
  • Utilize cheap disk space for indexing!

4
How to search for information?
  • Depends on what I know, or want to know
  • I remember reading a paper which presented an
    algorithm for clustering documents I think it
    was by Sahami
  • How soon might we have a cure for Diabetes?
  • When did I last send Paul email?

5
Other applications
  • Message routing
  • Send spam email to Deleted Items folder
  • Route customer queries to support engineers
  • Message filtering
  • Automatically send me copies of articles on
    training Support Vector Machines
  • Document clustering
  • Dynamic Yahoo-like topic hierarchies

6
Overview
  • Terminology
  • Current solutions
  • Information retrieval technology
  • Inverted indexes
  • Machine learning

7
Terminology
  • Corpus a set of documents
  • Index a mapping of terms or topics to documents
  • Dictionary list of terms appearing in corpus
  • Relevance
  • Precision fraction of retrieved documents that
    were relevant
  • Recall fraction of relevant documents that were
    retrieved

8
Libraries
  • Library A warehouse of information which has
    been indexed and cataloged to make retrieval of
    information easier in the future.
  • Traditional paper-based catalog/indexes
  • Author and title catalogs
  • Library of Congress topic index
  • Citation indexes

9
Modern Libraries
  • Traditional paper-based catalogs/indexes
  • Catalog/index databases
  • Traditional catalogs/indexes
  • Allow more extensive/sophisticated queries
  • Do not contain content, only bibliographic data

10
Search Engines
  • Information Retrieval technology
  • Index all web content
  • Keyword search over all content
  • Relevance ranking of results
  • Meta-search Engines
  • Query several search engines
  • Collate/merge query results

11
Other
  • Yahoo
  • Content-based catalog for internet
  • Human-generated catalog
  • Human-specified Categories
  • Librarians assign pages to categories
  • Nexus / Lexus
  • Full text of newswire articles, legal rulings
  • Indexed/cataloged content

12
Status Summary
  • Lots of different information sources are
    available
  • Balkanized information space
  • Disjoint sources
  • Disjoint indexes
  • Disjoint search methods
  • Some interesting technology

13
Information Retrieval
  • Text indexing and retrieval
  • Inverted indexes
  • Vector space model
  • Relevance ranking
  • Classification
  • Clustering

14
Inverted Indexes
  • Used to rapidly locate documents containing a
    particular word
  • Database of (nearly) all words appearing in any
    document
  • Each word has a list of all documents containing
    it
  • May contain frequency of appearance in document
  • May contain location(s) in document
  • Common words are often ignored
  • a, and, the, are known as stop words

15
Example Inverted Index
  • Document A The quick brown fox
  • Document B crazy like a fox
  • Inverted Index
  • Brown A
  • Crazy B
  • Fox A, B
  • Quick A

16
Dictionaries
  • List of all words (or root words) appearing in
    any document
  • Assigns a unique ID to each word (or root word)
  • Example dictionary
  • Brown 1
  • Crazy 2
  • Fox 3
  • Quick 4

17
Dictionaries (contd)
  • What is a word?
  • Root word (word stemming)
  • Computer, computes, computing,
  • Synonyms (thesauri)
  • Bewildered, confused, disoriented,
  • Polynyms (multiple meanings)
  • Last after all others (last in line), greatest
    (last degree), least important (last prize)

18
Vector-space Model
  • Each document is represented by a vector
  • Each vector has as many elements as the size of
    the dictionary
  • Each element in the vector contains the frequency
    of appearance for that word in that document
  • Most elements are usually zero
  • Positional information is lost
  • Example vectors
  • Document A 1, 0, 1, 1
  • Document B 0, 1, 1, 0

19
Vector-space Model
  • Documents similarity may be measured!
  • Euclidean distance
  • Cosine of angle between vectors
  • Enables nearly all aspects of information
    retrieval
  • Relevance ranking
  • Clustering
  • Vectors are generally HUGE and SPARSE for any
    reasonable corpus

20
Relevance Ranking
  • Methods for ranking the relative relevance of
    documents which satisfy query
  • Most methods measure distance between query and
    document
  • Some methods incorporate external information
  • Relative importance of word within document
    (e.g., it is part of the title or is a keyword)
  • Relative importance of the document (e.g., it is
    linked to by more web pages)

21
Machine Learning
  • Basic tasks
  • Regression
  • Classification
  • Clustering
  • General approaches
  • Supervised learning
  • Knowledge discovery

22
Features
  • Quantified aspects of problem
  • Words are often features
  • Problem-specific features
  • Is this email from a domain I know?
  • How many web pages link to this page?
  • Success of solution depends on quality of
    features!

23
Regression
  • Given examples (x,y), , predict yf(x)
  • Sample algorithms
  • Linear Regression
  • Neural Networks
  • Support Vector Machines

24
Classification
  • What class does this item belong to?
  • Email filtering is this junk mail?
  • Sample algorithms
  • Support Vector Machines
  • Neural Networks
  • Bayesian Networks
  • K-Nearest Neighbor

25
Clustering
  • Given a bag of items, create N bags of
    related items
  • Hierarchy generation automatically determine
    sub-topics
  • Sample algorithms
  • Hierarchical Agglomerative Clustering
  • Iterative k-Means Clustering
  • Support Vector Clustering

26
Supervised Learning
  • Input fixed-length vector of reals
  • Output fixed-length vector of reals
  • Training learn correlation between input and
    known output values
  • Operation given an input vector predict
    correct output vector

27
Accuracy Analysis
  • Need mechanisms to compare the effectiveness of
    various algorithms
  • Recall the fraction of relevant documents that
    were retrieved
  • Precision the fraction of retrieved documents
    that were relevant
  • Usually, increased performance in one metric
    results in decreased performance in the other

28
Course Project
  • Develop an information retrieval / digital
    library system
  • Use standard components
  • Python
  • HTTP fetching, HTML parsing, mailbox,
  • Extensibility MySQL, LAPACK interfaces
  • MySQL database for inverted index
  • LAPACK and ATLAS for array operations

29
Course Project (contd)
  • Basic framework exists
  • MySQL databases
  • Dictionary
  • Inverted Index
  • Scooter (web crawler)
  • Document Parser

30
Course Project (contd)
  • Aspects that you will develop include
  • Document relevance
  • Document clustering
  • Term creation (e.g., stemming)
  • Feature selection
Write a Comment
User Comments (0)
About PowerShow.com