http://comet.lehman.cuny.edu/jung/presentation/presentation.html presentation

About This Presentation

Transcript and Presenter's Notes

Title: http://comet.lehman.cuny.edu/jung/presentation/presentation.html

1
http//comet.lehman.cuny.edu/jung/presentation/pre
sentation.html

Introduction to Modern Information Retrieval and
Search Engines
And Some Research Issues
Professor Gwang Jung
Department of Mathematics
and Computer Science
Lehman College, CUNY
November 10, Fall 05

2
Outline

Introduction to Information Retrieval
Introduction to Search Engines (IR Systems for
the Web)
Search Engine Example Google
Brief Introduction to Semantic Web
Useful Tools for IR System Building and
Resources for Advanced Research
Research Issues

Introduction to Information Retrieval

4
Information Age
5
IR in General

Information Retrieval in general deals with
Retrieval of structured, semi-structured and
unstructured data (information items) in response
to a user query (topic statement).
User query
Structured (e.g., Boolean expression of keywords
or terms)
Unstructured (e.g., terms, sentence, document)
In other words, IR is the process of applying
algorithms over unstructured, semi-structured, or
structured data in order to satisfy a given
query.
Efficiency with respect to
Algorithms, Query processing, Data
organization/structure
Effectiveness with respect to
Retrieval results

6
IR Systems
7
Formal Definition of IR System

IRS (T, D, Q, F, R)
T set of index terms (terms)
D set of documents in a document database
Q set of user queries
F D x Q ? R (retrieval function)
R real numbers (RSV Retrieval Status Value)
Relevance Judgment is given by users.

8
IRS versus DBMS
9
IR Systems Focus on Retrieval effectiveness

The effective retrieval of relevant information
depends on
User task (formulating effective query for the
information need)
Indexing
IR systems in general adopt index terms to
represent documents and queries.
The process of developing document
representations by assigning index terms to
documents (information items).
Retrieval model (often called IR model) and
logical view of documents
Logical view of documents (logical representation
of documents) depends on IR model

10
Indexing

The process of developing document
representations by assigning descriptions to
information items (texts, documents, or
multimedia items).
Descriptors index terms terms
Descriptors also lead users to participate in
formulating information requests.
Two types of index terms
Objective author name, publisher, date of
publication
Subjective keywords selected from full text
Two types of indexing methods
Manual performed by human experts (for very
effective IR systems) may use ontology
Automatic performed by computer HW and SW

11
Indexing Aims (1)

Recall the proportion of relevant items
(documents) retrieved.
R of relevant items retrieved / total of
relevant items in the db
Precision the proportion of retrieved documents
that are relevant.
P of relevant items retrieved / total of
items retrieved
Effectiveness of indexing is mainly controlled by
Term Specificity
Broader terms may retrieve both useful (relevant)
and useless (non-relevant) info items for the
user.
Narrower (specific) index terms favor precision
at the expense of recall.
Index Language (set of well-selected index terms)
T index term t
Pre-specified (controlled) easy maintenance
poor adaptability
Uncontrolled (dynamic) expanded dynamically
taken freely from the texts to be indexed and
from the users queries.
Synonymous terms can be expanded to T by
thesaurus, e-dictionary (e.g., WordNet), and/or
knowledge base (e.g., ontology).

12
Indexing Aims (2)

Recall and Precision values vary from 0 to 1.
Average users want to have high recall and high
precision.
In practice, a compromise must be reached (middle
point).

R
1.0
P
0
1.0
13
Steps for Indexing

Objective attributes of a document are extracted
(e.g., title, author, URL, structure).
Grammatical functional words (stop words) in
general are not considered as index terms (e.g.,
of, then, this, and, ., etc).
Case insensitivity might be performed.
Stemming might be used.
Frequency of nonfunctional words are used to
specify the term importance.
Term frequency weight fulfils only one of the
indexing aims, I.e., Recall.
Terms that occur rarely in the individual
document database may be used to distinguish
documents in which they occur from those in which
they do not occur ? could improve Precision.
Document frequency the number of documents in
the collection in which a term tj ? T occurs

14
Inverted Index File
Inverted Index Entries
Optionally postings (the positions of the term in
a document)
15
Retrieval Models (1)

Set theoretic IR models
Documents are represented by a set of terms
Well known Set Theoretic Models
Boolean IR Model
Retrieval Function is based on Boolean operation
(e.g., and, or, not)
Query is formulated by Boolean logic
Fuzzy Set IR Model
Retrieval function is based on Fuzzy set
operations
Query is formulated by Boolean logic
Rough Set IR Model
Various set operations were examined.
Ad-hoc Boolean query
Probabilistic IR model
Mainly used for probabilistic index term
weighting
Provides mathematical framework for the well
known tfidf indexing scheme
Language Model based
Infer query concept from a document as retrieval
process

16
Retrieval Models (2)

Vector space model
Queries and documents are represented as weighted
vectors.
Vectors in the basis are called term vectors, and
assumed they are semantically independent.
A document (query) is represented as a linear
combination of vectors in the generating set.
Retrieval function is based on dot product or
cosine measure between document and query
vectors.
Extended Boolean IR model
Combine characteristics of the vector space IR
model with properties of Boolean algebra.
Retrieval function is based on Euclidean
distances in a n-dimensional vector space.
Distances are measured by using p-norms, where
1 ? p ? ?

17
The Retrieval Process
18
The retrieval Process in IR System
19

Introduction to Search Engines (IR Systems for
the Web)

20
World Wide Web History

1965 Hypertext
Ted Nelson developed idea of hypertext in 1965.
Late 1960s
Doug Engelbart invented the mouse and built the
first implementation of hypertext in the late
1960s at SRI.
Early 1970s
ARPANET was developed in the early 1970s.
1982 - Transmission Control Protocol (TCP) and
Internet Protocol (IP)
1989- WWW
Developed by Tim Berners-Lee and others in 1990
at CERN to organize research documents available
on the Internet.
Combined idea of documents available by FTP with
the idea of hypertext to link documents.
Developed initial HTTP network protocol, URLs,
HTML, and first web server.

21
Search Engine (Web-based IR System) History

By late 1980s many files were available by
anonymous FTP.
In 1990, Alan Emtage of McGill Univ. developed
Archie (short for archives)
Assembled lists of files available on many FTP
servers.
Allowed regular expression search of these file
names.
In 1993, Veronica and Jughead were developed to
search names of text files available through
Gopher servers.
In 1993, early web robots (spiders) were built to
collect URLs
Wanderer
ALIWEB (Archie-Like Index of the WEB)
WWW Worm (indexed URLs and titles for regex
search)
In 1994, Stanford graduate students David Filo
and Jerry Yang started manually collecting
popular web sites into a topical hierarchy called
Yahoo.

22
Search Engine History (contd)

In early 1994, Brian Pinkerton developed
WebCrawler as a class project at U Washington.
(eventually became part of Excite and AOL).
A few months later, Fuzzy Maudlin, a professor at
CMU developed Lycos with his graduate students.
First to use a standard IR system as developed
for the DARPA Tipster project.
First to index a large set of pages.
In late 1995, DEC developed Altavista.
Used a large farm of Alpha machines to quickly
process large numbers of queries.
Supported boolean operators, phrases, and
reverse pointer queries.
In 1998 Google was developed by graduate
students Larry Page Sergey Brin at Stanford U
use of link analysis to rank documents

23
How do Web SE Work?

Search Engines for the general web
search a database of the full text of web pages
selected from billions of Web pages
searching is based on inverted index entries
Search Engine Databases
Full text documents are collected by software
robot (also called softbot, spider). They
navigate the web for collecting pages.
Web can be viewed as a graph structure.
The navigation can be based on DFS (Depth First
Search), or BFS (Breadth First Search), or based
on some combined navigation heuristics.
How to detect cycles? ? research issue
Indexer then build inverted index entries stored
them into inverted files.
If necessary the inverted files may be
compressed.
Some types of pages links are excluded from the
search engine
form invisible Web (maybe many times bigger than
the visible Web).

24
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
Breadth-First Crawling
25
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
Depth-First Crawling
26
Web Search Engine System Architecture
27
Robot
User
Internet Websites
Interface
Temporary storage
Logical Document Representation (based on IR
Models)
Retrieval Mechanism
Parser
Inverted Files (can be based on Different
physical data structures
Stopper/Stemmer
Indexer
28
Distributed Architecture (example)

Harvest (http//harvest.sourceforge.net/)
Distributed web search engine
distribute the load among different machines
indexer doesn't run on the same machine as broker
or web server

29
What Makes a SE Good?

Database of web documents
Size of database
Freshness (Recency or up-to-datedness)
Types of documents offered
Retrieval Speed
The search engine's capabilities
Search options
Effectiveness of the retrieval mechanism
Support Concept-based search ? semantic web
Concept-based search systems try to determine
what you mean, not just what you say.
Concept-based often works better in theory than
in practice. Concept-based indexing is difficult
task to perform.
Presentation of the results
keywords highlighted in context
showing summary of the web page that match

Search Engine Example (Google)

31
Google

The most popular web search engine
Crawls (by robots) the web, stores a local cache
of found pages
Builds a lexicon of common words
For each word creates an index list of pages
containing it
Also human-compiled information from the Open
Directory
Cached links - let you see older versions of
recently changed ones
Link Analysis system
page rank heuristic
Estimated size of index
580 million pages visited and recorded
Uses link data to get to another 500 million
pages (by link analysis system)
Recent estimation is around 4 billion pages (??)
Index refresh
Updated monthly/weekly or daily for popular pages
Serves queries from three data centres (service
replication)
Service updates are synchronized.
Two on West Coast of the US, one on East Coast.

32
Google Founders

Larry Page, Co-founder President, Products

Sergey Brin, Co-founder President, Technology

PhD students at Stanford
Became public co. last year

33
Google Architecture Overview
34
Google Indexer
term frequencies
35
Google Lexicon
36
Google Searcher
37
Google Features

Combines traditional IR text matching with
extremely heavy use of link popularity to rank
the pages it has indexed.
Other services also use link popularity, but none
do to the extent that Google does.
Traditional IR (LITE)
Link Popularity (HEAVYLY USED)
Citation Importance Ranking (Quality of links
pointing at it)
Relevancy
Similarity between query and a page
Number of Links
Link Quality
Link Content
Ranking boosts on text styles
PageRank
Usage simulation Citation importance ranking
User randomly navigates
Process modelled by Markov Chain

38
Collecting Links in Google

Submission (by Web Promotion)
Add URL page (may not need to do a "deep" submit)
The best way to ensure that your site is indexed
is to build links. The more other sites are
pointing at you, the more likely you will be
crawled and ranked well.
Crawling and Index Depth
Aims to refresh its index on a monthly basis,
If Google doesn't actually index pages, it may
still return it in a search because it makes
extensive use of the text within hyperlinks.
This text is associated with the pages the link
points at, and it makes it possible for Google to
find matching pages even when these pages cannot
themselves be indexed.

39
Google Guidelines for Web-submission
40
Deep SubmitPro
41
Link Analysis for Relevancy (1)

Inspired by the CiteSeer (NEC International,
Princeton, NJ) and IBM Clever Project
CiteSeer..
http//www.almaden.ibm.com/cs/k53/clever.html
Google ranks web pages based on the number,
quality and content of links pointing at them
(citations).
Number of Links
All things being equal, a page with more links
pointing at it will do better than a page with
few or no links to it.
Link Quality
Numbers aren't everything. A single link from an
important site might be worth more than many
links from relatively unknown sites.
Weights page importance links from important
pages weighted higher

42
Link Analysis for Relevancy (2)

Link Content
The text in and around links relates to the page
they point at. For a page to rank well for
"travel," it would need to have many links that
use the word travel in them or near them on the
page. It also helps if the page itself is
textually relevant for travel.
Ranking boosts on text styles
The appearance of terms in bold text, or in
header text, or in a large font size is all taken
into account. None of these are dominant factors,
but they do figure into the overall equation.

43
PageRank

Usage simulation Citation importance ranking
Based on a model of a Web surfer who follows
links and makes occasional haphazard jumps,
arriving at certain places more frequently than
others.
User randomly navigates
Jumps to random page with probability p
Follows a random hyperlink from the page with
probability 1-p
Does not go back to a previously visited page by
following a previously traversed link backwards
Google finds a type of universally important page
intuitively
locations that are heavily visited in a random
traversal of the Web's link structure.

44
PageRank Heuristics

Process modelled by the following heuristics
probability of being in each page is computed, p
set by the system
wj PageRank of page j
ni number of outgoing links on page i
m is the number of nodes in G (the number of Web
pages in the collection)

45
PageRank Illusrtation
w1
wm

wj
w2
(1- p)
wn

w3
46
Google Spamming

Link popularity ranking system leaves it
relatively immune to traditional spamming
techniques.
Goes beyond the text on pages to decide how good
they are. No links, low rank.
Common spam idea
Create a lot of new pages within a site that link
to a single page, in an effort to boost that
page's popularity, perhaps spreading out these
pages across a network of sites.
The (Evil) Genius of Comment Spammers By Steven
Johnson, WIRED 12.03 http//www.wired.com/wired/ar
chive/12.03/google.html?pg7

47
http//www.wired.com/wired/archive/12.03/google.ht
ml?pg7
48
Topic Search http//www.google.com/options/index.h
tml
49
Brief Introduction to Semantic Web
50
Machine Process-able Knowledge on the Web

Unique identity of resources and objects- URI
Metadata Annotations
Data describing the content and meaning of
resources
But everyone must speak the same language
Terminologies
Shared and common vocabularies
But everyone must mean the same thing
Ontologies
Shared and common understanding of a domain
Essential for exchange and discovery of knowledge
Inference
Apply the knowledge in the metadata and the
ontology to create new metadata and new knowledge

51
The Semantic Web
52
Ontologies The Semantic Backbone
53
Language Tower in Semantic Web
Web Ontology Language 1.0 Reference http//www.w3.
org/TR/owl-ref/
Attribution
Explanation
Rules Inference
Ontologies
Metadata annotations
Standard Syntax
Identity
54
Person
participants gt1
Sport
Team-based Sport
Blackburn Rovers
participants gt1
Blackburn
Soccer Club
Soccer
Sports Club
UK
partof
Sport
Club
Europe
Country
Organisation
55
Event
Competition
Tournament
Sports Tournament
Soccer Tournament
Andy Cole
Brad Friedal
Soccer Player
Blackburn Rovers
Worthington Cup
Sports Player
56
Blackburn Rovers
Nottingham
UK
partof
Europe
birthplace
Country
Andy Cole
Soccer Player
nationality
Sports Player
Person
57
Blackburn Rovers
Lakewood
UK
USA
partof
Europe
Country
Country
Brad Friedal
birthplace
Soccer Player
nationality
Sports Player
Person
58

Useful IR System Building Software
And Resources

59
Lucene API (http//lucene.apache.org/)

Pure java (data abstraction, platform-independence
, components reusable)
High-performance indexing
Support both incremental indexing and batch
indexing
Provide Accurate and Efficient Searching
Mechanisms
Complex queries based on Boolean and phrase
queries, and quires with specific document fields
Ranked searching highest score being returned
first
Allow users to develop variety of new
applications
Searchable email
CD-based documentation search
DBMS Object ID management

60
http//www.getopt.org/luke/
61
www.egothor.org (support EBIR)
62
http//nltk.sourceforge.net/
63
http//ciir.cs.umass.edu/research/indri/
64
http//www.summarization.com/
65
http//wordnet.princeton.edu/
66
http//protege.stanford.edu/
67
http//www.google.com/apis/
68
http//www.amazon.com/gp/browse.html/103-1065429-7
111805?5FencodingUTF8node3435361
Then click Alexa Web Information Service 1.0
Released
69
http//mg4j.dsi.unimi.it/
70
http//www.xapian.org/history.php (Probabilistic
IR model)
71
http//www.searchtools.com/info/info-retrieval.htm
l
IR research resources
72
http//www-db.stanford.edu/db_pages/projects.html
73
http//dbpubs.stanford.edu8090/aux/index-en.html
74
http//citeseer.ist.psu.edu/
75

Web Challenges
for IR Research Community

76
Research Issues (1)

IR research field is interdisciplinary in nature
Traditionally focused on retrieval effectiveness
Retrieval models and mechanisms (e.g., various
ad-hoc models, probabilistic/statistic reasoning,
language model INDRI system at UMASS)
Use of Relevance feedback for improving
effectiveness (e.g., query reformulation,
pseudo-thesaurus, document categorization/clusteri
ng through machine learning techniques as
knowledge acquisition tools)
Knowledge/semantic richer retrieval approaches
(e.g., RUBRIC-rule based IR, some recent
concept-based IR based on Rules)
Information filtering based on user profiling
Traditionally based on small set of text
collections
Little work has been done on retrieval efficiency
although we have some reports (e.g., use of
parallel architecture for handling index files
based on signature files, etc)

77
Research Issues (2)

Challenges
Distributed Data Documents spread over millions
of different web servers.
Volatile Data Many documents change or
disappear rapidly (e.g. dead links) ? information
recency (up-to-datedness)
Large Volume Billions of separate documents.
Unstructured and Redundant Data No uniform
structure, HTML errors, up to 30 (near)
duplicate documents.
Quality of Data No editorial control, false
information, poor quality writing, typos, etc.
Need large scale knowledge/semantic rich
retrieval applications
Heterogeneous Data Multiple media types (images,
video, VRML), languages, character sets, etc.

78
Research Issues (3)

Retrieval Effectiveness (all in large scale) with
efficiency in mind
Test effectiveness of IR models with efficiency
as an important considerations
Effective and efficient indexing for both
documents and query
Natural language processing (some statistical)
Distributed incremental indexing
System and physical data structure/algorithm
issues
Distributed brokering architecture for
information recency
Investigation of semantic richer approaches
Semantic web, and other rule based approaches
Effective and efficient knowledge indexing
Use of users relevance feedback
Automatic feedback acquisition
User profiling and information filtering
Evaluation measures (Predictable)
Text summarization for better presentation
Text categorization (clustering) for topic search
(e.g., Yahoo subject directory, Google topic).

79
Research Issues (4)

Multimedia indexing
IBM QBIC project (http//wwwqbic.almaden.ibm.com/)
Indexing tools for various media types (e.g., an
image of mountain with a lake covered by snow,
SemCap)
Develop test bed for controllable experiments
Internet emulator/simulator
Distributed IR subsystems
Appropriate performance measures (e.g., RB
Precision)
Refer to the recent papers by Stanford
researchers addressing
Both retrieval effectiveness and efficiency

Write a Comment

User Comments (0)

About PowerShow.com

http://comet.lehman.cuny.edu/jung/presentation/presentation.html PowerPoint PPT Presentation