Intelligent Information Retrieval (and Web Search) - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Intelligent Information Retrieval (and Web Search)

Description:

Intelligent Information Retrieval (and Web Search) Professor Celso A A Kaestner, PhD. Brazil * – PowerPoint PPT presentation

Number of Views:240
Avg rating:3.0/5.0
Slides: 41
Provided by: Raymond194
Category:

less

Transcript and Presenter's Notes

Title: Intelligent Information Retrieval (and Web Search)


1
Intelligent Information Retrieval(and Web Search)
  • Professor Celso A A Kaestner, PhD.
  • Brazil

2
Sitewww.dainf.ct.utfpr.edu.br/kaestner/Konstanz
/iir.htm
3
Introduction
4
Introduction Information Retrieval
  • IR representation, storage, organization of, and
    access to information items
  • Focus is on the user information need
  • User information need
  • Find all docs containing information on college
    football teams which (1) are maintained by an
    university and (2) participate in the national
    tournament.
  • Emphasis is on the retrieval of information (not
    data).

5
Data retrieval x Information retrieval
  • Data Retrieval
  • which docs. contain a set of keywords?
  • well defined semantics
  • a single erroneous object implies failure!
  • Information Retrieval (IR)
  • information about a subject or topic
  • semantics is frequently loose
  • small errors are tolerated.
  • IR system
  • interpret contents of information items
  • generate a ranking which reflects relevance
  • notion of relevance is most important.

6
Information Retrieval (IR)
  • The indexing and retrieval of textual documents.
  • Searching for pages on the World Wide Web is the
    most recent killer app.
  • Concerned firstly with retrieving relevant
    documents to a query.
  • Concerned secondly with retrieving from large
    sets of documents efficiently.

7
Typical IR Task
  • Given
  • A corpus of textual natural-language documents.
  • A user query in the form of a textual string.
  • Find
  • A ranked set of documents that are relevant to
    the query.

8
IR System
IR System
9
Relevance
  • Relevance is a subjective judgment and may
    include
  • Being on the proper subject.
  • Being timely (recent information).
  • Being authoritative (from a trusted source).
  • Satisfying the goals of the user and his/her
    intended use of the information (information
    need).

10
Keyword Search
  • Simplest notion of relevance is that the query
    string appears verbatim in the document.
  • Slightly less strict notion is that the words in
    the query appear frequently in the document, in
    any order (bag of words).

11
Problems with Keywords
  • May not retrieve relevant documents that include
    synonymous terms.
  • restaurant vs. cafĂ©
  • PRC vs. China
  • May retrieve irrelevant documents that include
    ambiguous terms.
  • bat (baseball vs. mammal)
  • Apple (company vs. fruit)
  • bit (unit of data vs. act of eating)

12
Beyond Keywords
  • We will cover the basics of keyword-based IR,
    but
  • We will focus on extensions and recent
    developments that go beyond keywords.
  • We will cover the basics of building an efficient
    IR system, but
  • We will focus on basic capabilities and
    algorithms rather than systems issues that allow
    scaling to industrial size databases.

13
Intelligent IR
  • Taking into account the meaning of the words
    used.
  • Taking into account the order of words in the
    query.
  • Adapting to the user based on direct or indirect
    feedback.
  • Taking into account the authority of the source.

14
IR System Architecture
User Interface
Text
User Need
Text Operations
Logical View
Database Manager
Indexing
Query Operations
User Feedback
Inverted file
Searching
Index
Query
Text Database
Ranked Docs
Retrieved Docs
Ranking
15
IR System Components
  • Text Operations forms index words (tokens).
  • Standardization (caps )
  • Stopword removal
  • Stemming
  • Indexing constructs an inverted index of word to
    document pointers.
  • Searching retrieves documents that contain a
    given query token from the inverted index.
  • Ranking scores all retrieved documents according
    to a relevance metric.

16
IR System Components (continued)
  • User Interface manages interaction with the user
  • Query input and document output.
  • Relevance feedback.
  • Visualization of results.
  • Query Operations transform the query to improve
    retrieval
  • Query expansion using a thesaurus.
  • Query transformation using relevance feedback.

17
IR and the Web
  • IR at the center of the stage
  • Advent of the Web changed this perception once
    and for all
  • universal repository of knowledge
  • free (low cost) universal access
  • no central editorial board
  • many problems though IR seen as key to finding
    the solutions!

18
IR and the Web
  • And more
  • Most of the human task employ the treatment of
    information in textual and/ or graphic form
    (Lyman, 2003)
  • How Much Information project (Berkeley)
  • www.sims.berkeley.edu/how-much-info-2003.
  • Each person generates 800 Mbytes / year.

19
IR and the Web
  • In 2002 5 Exabytes of new information
  • Magnetic media (HDs) 92
  • Films 7
  • Print material 0,01
  • Optical media 0,002.
  • 5 Exabytes 5 million Terabytes
    5.000.000.000.000.000.000 bytes
  • 2 times the amount of 1999, given an increasing
    rate of 30 / year.

20
IR and the Web
  • Information flow - radio, TV, Internet
  • 18 Exabytes of new information in 2002
  • 3,5 times of the amount stored
  • Telephone lines (and cell phones) 98
  • 320 million hours of radio and TV transmissions,
    with 70 million new hours, with 81 Gigabytes of
    texts.

21
IR and the Web
  • Email
  • 31 billion of e-mails / year 400.000 Tbytes of
    new information
  • The Internet (Web)
  • 170 Tbytes of information 17 times the printed
    content of the US Library of Congress.

22
IR and the Web
  • Search sites
  • Yahoo, Google, etc. the 1st option of
    access for the users
  • A typical Internet user 11 h 20 m / month
  • Access to the desired information 1 / 3 of the
    period
  • The user is obliged to verify if the received
    information is the desired one, and several times
    is impossible to recover the information needed.

23
IR and the Web
  • Information Glut or Information Overload is the
    main challenge to be surpassed by automatic text
    treatment systems.

24
Web Search
  • Application of IR to HTML documents on the World
    Wide Web.
  • Differences
  • Must assemble document corpus by spidering the
    web.
  • Can exploit the structural layout information in
    HTML (XML).
  • Documents change uncontrollably.
  • Can exploit the link structure of the web.

25
Web Search System
IR System
26
Other IR-Related Tasks
  • Automated document categorization
  • Information filtering (spam filtering)
  • Information routing
  • Automated document clustering
  • Recommending information or products
  • Information extraction
  • Information integration
  • Question answering

27
History of IR
  • 1960-70s
  • Initial exploration of text retrieval systems
    for small corpora of scientific abstracts, and
    law and business documents.
  • Development of the basic Boolean and vector-space
    models of retrieval.
  • Prof. Salton and his students at Cornell
    University are the leading researchers in the
    area.

28
IR History Continued
  • 1980s
  • Large document database systems, many run by
    companies
  • Lexis-Nexis
  • Dialog
  • MEDLINE

29
IR History Continued
  • 1990s
  • Searching FTPable documents on the Internet
  • Archie
  • WAIS
  • Searching the World Wide Web
  • Lycos
  • Yahoo
  • Altavista

30
IR History Continued
  • 1990s continued
  • Organized Competitions
  • NIST TREC
  • Recommender Systems
  • Ringo
  • Amazon
  • NetPerceptions
  • Automated Text Categorization Clustering

31
Recent IR History
  • 2000s
  • Link analysis for Web Search
  • Google
  • Automated Information Extraction
  • Whizbang
  • Fetch
  • Burning Glass
  • Question Answering
  • TREC Q/A track

32
Recent IR History
  • 2000s continued
  • Multimedia IR
  • Image
  • Video
  • Audio and music
  • Cross-Language IR
  • DARPA Tides
  • Document Summarization

33
Related Areas
  • Database Management
  • Library and Information Science
  • Artificial Intelligence
  • Natural Language Processing
  • Machine Learning

34
Database Management
  • Focused on structured data stored in relational
    tables rather than free-form text.
  • Focused on efficient processing of well-defined
    queries in a formal language (SQL).
  • Clearer semantics for both data and queries.
  • Recent move towards semi-structured data (XML)
    brings it closer to IR.

35
Library and Information Science
  • Focused on the human user aspects of information
    retrieval (human-computer interaction, user
    interface, visualization).
  • Concerned with effective categorization of human
    knowledge.
  • Concerned with citation analysis and
    bibliometrics (structure of information).
  • Recent work on digital libraries brings it closer
    to CS IR.

36
Artificial Intelligence
  • Focused on the representation of knowledge,
    reasoning, and intelligent action.
  • Formalisms for representing knowledge and
    queries
  • First-order Predicate Logic
  • Bayesian Networks
  • Others
  • Recent work on web ontologies and intelligent
    information agents brings it closer to IR.

37
Natural Language Processing
  • Focused on the syntactic, semantic, and pragmatic
    analysis of natural language text and discourse.
  • Ability to analyze syntax (phrase structure) and
    semantics could allow retrieval based on meaning
    rather than keywords.

38
Natural Language ProcessingIR Directions
  • Methods for determining the sense of an ambiguous
    word based on context (word sense
    disambiguation).
  • Methods for identifying specific pieces of
    information in a document (information
    extraction).
  • Methods for answering specific NL questions from
    document corpora.

39
Machine Learning
  • Focused on the development of computational
    systems that improve their performance with
    experience.
  • Automated classification of examples based on
    learning concepts from labeled training examples
    (supervised learning).
  • Automated methods for clustering unlabeled
    examples into meaningful groups (unsupervised
    learning).

40
Machine LearningIR Directions
  • Text Categorization
  • Automatic hierarchical classification (Yahoo).
  • Adaptive filtering/routing/recommending.
  • Automated spam filtering.
  • Text Clustering
  • Clustering of IR query results.
  • Automatic formation of hierarchies (Yahoo).
  • Learning for Information Extraction
  • Text Mining
  • Text Summarization
Write a Comment
User Comments (0)
About PowerShow.com