http://comet.lehman.cuny.edu/jung/presentation/presentation.html - PowerPoint PPT Presentation

About This Presentation
Title:

http://comet.lehman.cuny.edu/jung/presentation/presentation.html

Description:

http://comet.lehman.cuny.edu/jung/presentation/presentation.html Introduction to Modern Information Retrieval and Search Engines And Some Research Issues – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 80
Provided by: cometLehm
Category:

less

Transcript and Presenter's Notes

Title: http://comet.lehman.cuny.edu/jung/presentation/presentation.html


1
http//comet.lehman.cuny.edu/jung/presentation/pre
sentation.html
  • Introduction to Modern Information Retrieval and
    Search Engines
  • And Some Research Issues
  • Professor Gwang Jung
  • Department of Mathematics
  • and Computer Science
  • Lehman College, CUNY
  • November 10, Fall 05

2
Outline
  • Introduction to Information Retrieval
  • Introduction to Search Engines (IR Systems for
    the Web)
  • Search Engine Example Google
  • Brief Introduction to Semantic Web
  • Useful Tools for IR System Building and
    Resources for Advanced Research
  • Research Issues

3
  • Introduction to Information Retrieval

4
Information Age
5
IR in General
  • Information Retrieval in general deals with
  • Retrieval of structured, semi-structured and
    unstructured data (information items) in response
    to a user query (topic statement).
  • User query
  • Structured (e.g., Boolean expression of keywords
    or terms)
  • Unstructured (e.g., terms, sentence, document)
  • In other words, IR is the process of applying
    algorithms over unstructured, semi-structured, or
    structured data in order to satisfy a given
    query.
  • Efficiency with respect to
  • Algorithms, Query processing, Data
    organization/structure
  • Effectiveness with respect to
  • Retrieval results

6
IR Systems
7
Formal Definition of IR System
  • IRS (T, D, Q, F, R)
  • T set of index terms (terms)
  • D set of documents in a document database
  • Q set of user queries
  • F D x Q ? R (retrieval function)
  • R real numbers (RSV Retrieval Status Value)
  • Relevance Judgment is given by users.

8
IRS versus DBMS
9
IR Systems Focus on Retrieval effectiveness
  • The effective retrieval of relevant information
    depends on
  • User task (formulating effective query for the
    information need)
  • Indexing
  • IR systems in general adopt index terms to
    represent documents and queries.
  • The process of developing document
    representations by assigning index terms to
    documents (information items).
  • Retrieval model (often called IR model) and
    logical view of documents
  • Logical view of documents (logical representation
    of documents) depends on IR model

10
Indexing
  • The process of developing document
    representations by assigning descriptions to
    information items (texts, documents, or
    multimedia items).
  • Descriptors index terms terms
  • Descriptors also lead users to participate in
    formulating information requests.
  • Two types of index terms
  • Objective author name, publisher, date of
    publication
  • Subjective keywords selected from full text
  • Two types of indexing methods
  • Manual performed by human experts (for very
    effective IR systems) may use ontology
  • Automatic performed by computer HW and SW

11
Indexing Aims (1)
  • Recall the proportion of relevant items
    (documents) retrieved.
  • R of relevant items retrieved / total of
    relevant items in the db
  • Precision the proportion of retrieved documents
    that are relevant.
  • P of relevant items retrieved / total of
    items retrieved
  • Effectiveness of indexing is mainly controlled by
    Term Specificity
  • Broader terms may retrieve both useful (relevant)
    and useless (non-relevant) info items for the
    user.
  • Narrower (specific) index terms favor precision
    at the expense of recall.
  • Index Language (set of well-selected index terms)
  • T index term t
  • Pre-specified (controlled) easy maintenance
    poor adaptability
  • Uncontrolled (dynamic) expanded dynamically
    taken freely from the texts to be indexed and
    from the users queries.
  • Synonymous terms can be expanded to T by
    thesaurus, e-dictionary (e.g., WordNet), and/or
    knowledge base (e.g., ontology).

12
Indexing Aims (2)
  • Recall and Precision values vary from 0 to 1.
  • Average users want to have high recall and high
    precision.
  • In practice, a compromise must be reached (middle
    point).

R
1.0
P
0
1.0
13
Steps for Indexing
  • Objective attributes of a document are extracted
    (e.g., title, author, URL, structure).
  • Grammatical functional words (stop words) in
    general are not considered as index terms (e.g.,
    of, then, this, and, ., etc).
  • Case insensitivity might be performed.
  • Stemming might be used.
  • Frequency of nonfunctional words are used to
    specify the term importance.
  • Term frequency weight fulfils only one of the
    indexing aims, I.e., Recall.
  • Terms that occur rarely in the individual
    document database may be used to distinguish
    documents in which they occur from those in which
    they do not occur ? could improve Precision.
  • Document frequency the number of documents in
    the collection in which a term tj ? T occurs

14
Inverted Index File
Inverted Index Entries
Optionally postings (the positions of the term in
a document)
15
Retrieval Models (1)
  • Set theoretic IR models
  • Documents are represented by a set of terms
  • Well known Set Theoretic Models
  • Boolean IR Model
  • Retrieval Function is based on Boolean operation
    (e.g., and, or, not)
  • Query is formulated by Boolean logic
  • Fuzzy Set IR Model
  • Retrieval function is based on Fuzzy set
    operations
  • Query is formulated by Boolean logic
  • Rough Set IR Model
  • Various set operations were examined.
  • Ad-hoc Boolean query
  • Probabilistic IR model
  • Mainly used for probabilistic index term
    weighting
  • Provides mathematical framework for the well
    known tfidf indexing scheme
  • Language Model based
  • Infer query concept from a document as retrieval
    process

16
Retrieval Models (2)
  • Vector space model
  • Queries and documents are represented as weighted
    vectors.
  • Vectors in the basis are called term vectors, and
    assumed they are semantically independent.
  • A document (query) is represented as a linear
    combination of vectors in the generating set.
  • Retrieval function is based on dot product or
    cosine measure between document and query
    vectors.
  • Extended Boolean IR model
  • Combine characteristics of the vector space IR
    model with properties of Boolean algebra.
  • Retrieval function is based on Euclidean
    distances in a n-dimensional vector space.
    Distances are measured by using p-norms, where
    1 ? p ? ?

17
The Retrieval Process
18
The retrieval Process in IR System
19
  • Introduction to Search Engines (IR Systems for
    the Web)

20
World Wide Web History
  • 1965 Hypertext
  • Ted Nelson developed idea of hypertext in 1965.
  • Late 1960s
  • Doug Engelbart invented the mouse and built the
    first implementation of hypertext in the late
    1960s at SRI.
  • Early 1970s
  • ARPANET was developed in the early 1970s.
  • 1982 - Transmission Control Protocol (TCP) and
    Internet Protocol (IP)
  • 1989- WWW
  • Developed by Tim Berners-Lee and others in 1990
    at CERN to organize research documents available
    on the Internet.
  • Combined idea of documents available by FTP with
    the idea of hypertext to link documents.
  • Developed initial HTTP network protocol, URLs,
    HTML, and first web server.

21
Search Engine (Web-based IR System) History
  • By late 1980s many files were available by
    anonymous FTP.
  • In 1990, Alan Emtage of McGill Univ. developed
    Archie (short for archives)
  • Assembled lists of files available on many FTP
    servers.
  • Allowed regular expression search of these file
    names.
  • In 1993, Veronica and Jughead were developed to
    search names of text files available through
    Gopher servers.
  • In 1993, early web robots (spiders) were built to
    collect URLs
  • Wanderer
  • ALIWEB (Archie-Like Index of the WEB)
  • WWW Worm (indexed URLs and titles for regex
    search)
  • In 1994, Stanford graduate students David Filo
    and Jerry Yang started manually collecting
    popular web sites into a topical hierarchy called
    Yahoo.

22
Search Engine History (contd)
  • In early 1994, Brian Pinkerton developed
    WebCrawler as a class project at U Washington.
    (eventually became part of Excite and AOL).
  • A few months later, Fuzzy Maudlin, a professor at
    CMU developed Lycos with his graduate students.
  • First to use a standard IR system as developed
    for the DARPA Tipster project.
  • First to index a large set of pages.
  • In late 1995, DEC developed Altavista.
  • Used a large farm of Alpha machines to quickly
    process large numbers of queries.
  • Supported boolean operators, phrases, and
    reverse pointer queries.
  • In 1998 Google was developed by graduate
    students Larry Page Sergey Brin at Stanford U
  • use of link analysis to rank documents

23
How do Web SE Work?
  • Search Engines for the general web
  • search a database of the full text of web pages
    selected from billions of Web pages
  • searching is based on inverted index entries
  • Search Engine Databases
  • Full text documents are collected by software
    robot (also called softbot, spider). They
    navigate the web for collecting pages.
  • Web can be viewed as a graph structure.
  • The navigation can be based on DFS (Depth First
    Search), or BFS (Breadth First Search), or based
    on some combined navigation heuristics.
  • How to detect cycles? ? research issue
  • Indexer then build inverted index entries stored
    them into inverted files.
  • If necessary the inverted files may be
    compressed.
  • Some types of pages links are excluded from the
    search engine
  • form invisible Web (maybe many times bigger than
    the visible Web).

24
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
Breadth-First Crawling
25
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
Depth-First Crawling
26
Web Search Engine System Architecture
27
Robot
User
Internet Websites
Interface
Temporary storage
Logical Document Representation (based on IR
Models)
Retrieval Mechanism
Parser
Inverted Files (can be based on Different
physical data structures
Stopper/Stemmer
Indexer
28
Distributed Architecture (example)
  • Harvest (http//harvest.sourceforge.net/)
  • Distributed web search engine
  • distribute the load among different machines
  • indexer doesn't run on the same machine as broker
    or web server

29
What Makes a SE Good?
  • Database of web documents
  • Size of database
  • Freshness (Recency or up-to-datedness)
  • Types of documents offered
  • Retrieval Speed
  • The search engine's capabilities
  • Search options
  • Effectiveness of the retrieval mechanism
  • Support Concept-based search ? semantic web
  • Concept-based search systems try to determine
    what you mean, not just what you say.
  • Concept-based often works better in theory than
    in practice. Concept-based indexing is difficult
    task to perform.
  • Presentation of the results
  • keywords highlighted in context
  • showing summary of the web page that match

30
  • Search Engine Example (Google)

31
Google
  • The most popular web search engine
  • Crawls (by robots) the web, stores a local cache
    of found pages
  • Builds a lexicon of common words
  • For each word creates an index list of pages
    containing it
  • Also human-compiled information from the Open
    Directory
  • Cached links - let you see older versions of
    recently changed ones
  • Link Analysis system
  • page rank heuristic
  • Estimated size of index
  • 580 million pages visited and recorded
  • Uses link data to get to another 500 million
    pages (by link analysis system)
  • Recent estimation is around 4 billion pages (??)
  • Index refresh
  • Updated monthly/weekly or daily for popular pages
  • Serves queries from three data centres (service
    replication)
  • Service updates are synchronized.
  • Two on West Coast of the US, one on East Coast.

32
Google Founders
  • Larry Page, Co-founder President, Products
  • Sergey Brin, Co-founder President, Technology
  • PhD students at Stanford
  • Became public co. last year

33
Google Architecture Overview
34
Google Indexer
term frequencies
35
Google Lexicon
36
Google Searcher
37
Google Features
  • Combines traditional IR text matching with
    extremely heavy use of link popularity to rank
    the pages it has indexed.
  • Other services also use link popularity, but none
    do to the extent that Google does.
  • Traditional IR (LITE)
  • Link Popularity (HEAVYLY USED)
  • Citation Importance Ranking (Quality of links
    pointing at it)
  • Relevancy
  • Similarity between query and a page
  • Number of Links
  • Link Quality
  • Link Content
  • Ranking boosts on text styles
  • PageRank
  • Usage simulation Citation importance ranking
  • User randomly navigates
  • Process modelled by Markov Chain

38
Collecting Links in Google
  • Submission (by Web Promotion)
  • Add URL page (may not need to do a "deep" submit)
  • The best way to ensure that your site is indexed
    is to build links. The more other sites are
    pointing at you, the more likely you will be
    crawled and ranked well.
  • Crawling and Index Depth
  • Aims to refresh its index on a monthly basis,
  • If Google doesn't actually index pages, it may
    still return it in a search because it makes
    extensive use of the text within hyperlinks.
  • This text is associated with the pages the link
    points at, and it makes it possible for Google to
    find matching pages even when these pages cannot
    themselves be indexed.

39
Google Guidelines for Web-submission
40
Deep SubmitPro
41
Link Analysis for Relevancy (1)
  • Inspired by the CiteSeer (NEC International,
    Princeton, NJ) and IBM Clever Project
  • CiteSeer..
  • http//www.almaden.ibm.com/cs/k53/clever.html
  • Google ranks web pages based on the number,
    quality and content of links pointing at them
    (citations).
  • Number of Links
  • All things being equal, a page with more links
    pointing at it will do better than a page with
    few or no links to it.
  • Link Quality
  • Numbers aren't everything. A single link from an
    important site might be worth more than many
    links from relatively unknown sites.
  • Weights page importance links from important
    pages weighted higher

42
Link Analysis for Relevancy (2)
  • Link Content
  • The text in and around links relates to the page
    they point at. For a page to rank well for
    "travel," it would need to have many links that
    use the word travel in them or near them on the
    page. It also helps if the page itself is
    textually relevant for travel.
  • Ranking boosts on text styles
  • The appearance of terms in bold text, or in
    header text, or in a large font size is all taken
    into account. None of these are dominant factors,
    but they do figure into the overall equation.

43
PageRank
  • Usage simulation Citation importance ranking
  • Based on a model of a Web surfer who follows
    links and makes occasional haphazard jumps,
    arriving at certain places more frequently than
    others.
  • User randomly navigates
  • Jumps to random page with probability p
  • Follows a random hyperlink from the page with
    probability 1-p
  • Does not go back to a previously visited page by
    following a previously traversed link backwards
  • Google finds a type of universally important page
    intuitively
  • locations that are heavily visited in a random
    traversal of the Web's link structure.

44
PageRank Heuristics
  • Process modelled by the following heuristics
  • probability of being in each page is computed, p
    set by the system
  • wj PageRank of page j
  • ni number of outgoing links on page i
  • m is the number of nodes in G (the number of Web
    pages in the collection)

45
PageRank Illusrtation
w1
wm

wj
w2
(1- p)
wn

w3
46
Google Spamming
  • Link popularity ranking system leaves it
    relatively immune to traditional spamming
    techniques.
  • Goes beyond the text on pages to decide how good
    they are. No links, low rank.
  • Common spam idea
  • Create a lot of new pages within a site that link
    to a single page, in an effort to boost that
    page's popularity, perhaps spreading out these
    pages across a network of sites.
  • The (Evil) Genius of Comment Spammers By Steven
    Johnson, WIRED 12.03 http//www.wired.com/wired/ar
    chive/12.03/google.html?pg7

47
http//www.wired.com/wired/archive/12.03/google.ht
ml?pg7
48
Topic Search http//www.google.com/options/index.h
tml
49
Brief Introduction to Semantic Web
50
Machine Process-able Knowledge on the Web
  • Unique identity of resources and objects- URI
  • Metadata Annotations
  • Data describing the content and meaning of
    resources
  • But everyone must speak the same language
  • Terminologies
  • Shared and common vocabularies
  • But everyone must mean the same thing
  • Ontologies
  • Shared and common understanding of a domain
  • Essential for exchange and discovery of knowledge
  • Inference
  • Apply the knowledge in the metadata and the
    ontology to create new metadata and new knowledge

51
The Semantic Web
52
Ontologies The Semantic Backbone
53
Language Tower in Semantic Web
Web Ontology Language 1.0 Reference http//www.w3.
org/TR/owl-ref/
Attribution
Explanation
Rules Inference
Ontologies
Metadata annotations
Standard Syntax
Identity
54
Person
participants gt1
Sport
Team-based Sport
Blackburn Rovers
participants gt1
Blackburn
Soccer Club
Soccer
Sports Club
UK
partof
Sport
Club
Europe
Country
Organisation
55
Event
Competition
Tournament
Sports Tournament
Soccer Tournament
Andy Cole
Brad Friedal
Soccer Player
Blackburn Rovers
Worthington Cup
Sports Player
56
Blackburn Rovers
Nottingham
UK
partof
Europe
birthplace
Country
Andy Cole
Soccer Player
nationality
Sports Player
Person
57
Blackburn Rovers
Lakewood
UK
USA
partof
Europe
Country
Country
Brad Friedal
birthplace
Soccer Player
nationality
Sports Player
Person
58
  • Useful IR System Building Software
  • And Resources

59
Lucene API (http//lucene.apache.org/)
  • Pure java (data abstraction, platform-independence
    , components reusable)
  • High-performance indexing
  • Support both incremental indexing and batch
    indexing
  • Provide Accurate and Efficient Searching
    Mechanisms
  • Complex queries based on Boolean and phrase
    queries, and quires with specific document fields
  • Ranked searching highest score being returned
    first
  • Allow users to develop variety of new
    applications
  • Searchable email
  • CD-based documentation search
  • DBMS Object ID management

60
http//www.getopt.org/luke/
61
www.egothor.org (support EBIR)
62
http//nltk.sourceforge.net/
63
http//ciir.cs.umass.edu/research/indri/
64
http//www.summarization.com/
65
http//wordnet.princeton.edu/
66
http//protege.stanford.edu/
67
http//www.google.com/apis/
68
http//www.amazon.com/gp/browse.html/103-1065429-7
111805?5FencodingUTF8node3435361
Then click Alexa Web Information Service 1.0
Released
69
http//mg4j.dsi.unimi.it/
70
http//www.xapian.org/history.php (Probabilistic
IR model)
71
http//www.searchtools.com/info/info-retrieval.htm
l
IR research resources
72
http//www-db.stanford.edu/db_pages/projects.html
73
http//dbpubs.stanford.edu8090/aux/index-en.html
74
http//citeseer.ist.psu.edu/
75
  • Web Challenges
  • for IR Research Community

76
Research Issues (1)
  • IR research field is interdisciplinary in nature
  • Traditionally focused on retrieval effectiveness
  • Retrieval models and mechanisms (e.g., various
    ad-hoc models, probabilistic/statistic reasoning,
    language model INDRI system at UMASS)
  • Use of Relevance feedback for improving
    effectiveness (e.g., query reformulation,
    pseudo-thesaurus, document categorization/clusteri
    ng through machine learning techniques as
    knowledge acquisition tools)
  • Knowledge/semantic richer retrieval approaches
    (e.g., RUBRIC-rule based IR, some recent
    concept-based IR based on Rules)
  • Information filtering based on user profiling
  • Traditionally based on small set of text
    collections
  • Little work has been done on retrieval efficiency
    although we have some reports (e.g., use of
    parallel architecture for handling index files
    based on signature files, etc)

77
Research Issues (2)
  • Challenges
  • Distributed Data Documents spread over millions
    of different web servers.
  • Volatile Data Many documents change or
    disappear rapidly (e.g. dead links) ? information
    recency (up-to-datedness)
  • Large Volume Billions of separate documents.
  • Unstructured and Redundant Data No uniform
    structure, HTML errors, up to 30 (near)
    duplicate documents.
  • Quality of Data No editorial control, false
    information, poor quality writing, typos, etc.
  • Need large scale knowledge/semantic rich
    retrieval applications
  • Heterogeneous Data Multiple media types (images,
    video, VRML), languages, character sets, etc.

78
Research Issues (3)
  • Retrieval Effectiveness (all in large scale) with
    efficiency in mind
  • Test effectiveness of IR models with efficiency
    as an important considerations
  • Effective and efficient indexing for both
    documents and query
  • Natural language processing (some statistical)
  • Distributed incremental indexing
  • System and physical data structure/algorithm
    issues
  • Distributed brokering architecture for
    information recency
  • Investigation of semantic richer approaches
  • Semantic web, and other rule based approaches
  • Effective and efficient knowledge indexing
  • Use of users relevance feedback
  • Automatic feedback acquisition
  • User profiling and information filtering
  • Evaluation measures (Predictable)
  • Text summarization for better presentation
  • Text categorization (clustering) for topic search
  • (e.g., Yahoo subject directory, Google topic).

79
Research Issues (4)
  • Multimedia indexing
  • IBM QBIC project (http//wwwqbic.almaden.ibm.com/)
  • Indexing tools for various media types (e.g., an
    image of mountain with a lake covered by snow,
    SemCap)
  • Develop test bed for controllable experiments
  • Internet emulator/simulator
  • Distributed IR subsystems
  • Appropriate performance measures (e.g., RB
    Precision)
  • Refer to the recent papers by Stanford
    researchers addressing
  • Both retrieval effectiveness and efficiency
Write a Comment
User Comments (0)
About PowerShow.com