Information Retrieval Systems Info624 Week 1 - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Information Retrieval Systems Info624 Week 1

Description:

Info624 Week 1. Dr. Min Song. College of Information Science and Technology ... ever ... Which statement do you like the best? It is easy to find just about ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 37
Provided by: pagesD
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval Systems Info624 Week 1


1
Information Retrieval SystemsInfo624 Week 1
  • Dr. Min Song
  • College of Information Science and Technology
  • Drexel University

2
Self-Introduction
  • My Journey in America
  • Atlanta, GA
  • Denton, TX
  • College Park, MD
  • San Jose, CA
  • White Plains, NY
  • Lexington, KY
  • Philadelphia, PA

3
Have you ever asked
  • How could search engines find the information I
    request so quickly, out of millions and millions
    of web pages?
  • Which statement do you like the best?
  • It is easy to find just about anything on the
    Web.
  • Its impossible to find anything on the Web I
    always find so many things that I dont want.
  • How about these
  • I like search engines very much.
  • I hate search engines!

4
More questions
  • What kinds of questions are easy (difficult) to
    find on the Web?
  • Why?
  • Are there any ways to make it easier?
  • What solutions are we looking for
  • Technical solution?
  • Cognitive solution?

5
Google is the solution?
  • Everyone likes Google.
  • True or False?
  • What would happen if Google disppears on the Web
    tomorrow morning?
  • Better Search Results than Google?
  • CNN report, Jan. 5, 2004
  • vivisimo.com/
  • Grokker2
  • TouchGraph

6
How to defeat Google?
  • Microsoft Way
  • I will buy you,
  • Or I will netscape you!
  • Open Source Way
  • Watch Nutch
  • Under the leadership of Doug Cutting
  • The Linux of search engines

7
Course Overview
  • What this course is about
  • How people search and find information.
  • How computers store and retrieve information.
  • How computer systems are designed to help people
    find information they need.

8
Course Overview
  • The course will emphasize on
  • Understanding of
  • Theories
  • Tools
  • Algorithms, and
  • Evaluations
  • for Information Retrieval Systems.

9
Course Overview
  • What this course is NOT ...
  • An algorithm design course
  • We might use several related algorithms, not
    study them in details
  • Our textbook could be used for such a course
  • a system development course
  • Except some assignments may require you to
    compile some C procedures.
  • We look at an IR system as a whole, not as
    individual components

10
Required skills
  • Know how to create html pages
  • Have access to a Web server
  • If you dont, the best way is to apply an dunx1
    account from Drexel.
  • Make sure you request
  • Web server access.
  • Shell access.
  • Have access to a C compiler
  • Having Dunx1 Shell access will do it.

11
Project Idea -1
  • Install and implement an IR system
  • Index a sample document collection or a Web site
  • Test and evaluate all the functionalities of the
    system.
  • Compare this IR system with others.
  • Demonstrate the implementation in class.

12
Project idea -- 2
  • Conduct an evaluation experiment on one or two
    selected IR systems
  • Identify the systems
  • Install the systems, if necessary
  • Design the experimental methods
  • Test the experimental methods
  • Analyze the data and write the final report.

13
Project idea -3
  • Customize an IR system
  • Using an open source retrieval software
  • Apache Lucene
  • Implementing a crawler
  • With some open source codes
  • Designing a new retrieval interface

14
What is IR?
  • IR is a branch of applied computer science
    focusing on the representation, storage,
    organization, access, and distribution of
    information.
  • IR involves helping users find information that
    matches their information needs.

15
IR Systems
  • IR systems contain three components
  • System
  • People
  • Documents (information items)

16
Web brings IR to the Center of the Stage
  • IR has become a center of the focus in the Web
    era. Its theories, techniques, and applications
    have reached many fields where processing large
    amount of information is essential.

17
Challenges of IR
18
Examples
  • Where can I find information needed for my term
    project?
  • Challenges
  • How do you translate the question to a query?
  • What info. needs to store in the system in order
    to answer the question?
  • Which system will match the request best?

19
Examples
  • Which IST course is most useful?
  • Challenges
  • Information may not exist anywhere
  • Its personal opinion.
  • Where is bin Laden now?
  • Challenges
  • Intelligence Analysis
  • Need the first-hand information

20
Abstraction Principles
  • First Abstraction Principle
  • Abstract data from the real world
  • And make them available to the system.
  • Second Abstraction Principles
  • Abstract the users information needs into a form
    the system understands.

21
Users
  • The user
  • anyone who need to find some information
  • The user groups
  • group by their knowledge of the system
  • novice users vs. experienced users
  • end users vs. information specialists
  • group by their domain knowledge
  • Domain experts vs. general public
  • group by information needs
  • need to locate a particular item
  • need some information
  • need all information on a subject

22
Users Information Needs
  • People depend on information to carry out their
    daily activities.
  • need to accomplish some goals.
  • need to solve some problems.
  • People realize a lack of information
  • perceive a gap in their knowledge state
  • ASK -- Anomalous State of Knowledge
  • desire to fill the gap

23
Users information needs
?
?
?
?
?
??
24
?
?
?
?
?
??
??
25
Data and Information
  • Data
  • String of symbols associated with objects,
    people, and events
  • Values of an attribute
  • Data need not have meaning to everyone
  • Data must be interpreted with associated
    attributes.

26
Data and Information
  • Information
  • The meaning of the data interpreted by a person
    or a system
  • Data that changes the state of a person or system
    that perceives it.
  • Data that reduces uncertainty.
  • if data contain no uncertainty, there are no
    information with the data.
  • Examples It snows in the winter.
  • It does not snow this winter.

27
Information and Knowledge
  • knowledge
  • Structured information
  • through structuring, information becomes
    understandable
  • Processed Information
  • through processing, information becomes
    meaningful and useful
  • information shared and agreed upon within a
    community

28
Text
  • Strings of ASCII symbols or Unicode
  • structured by the author
  • indexed by information service providers
  • Representation of natural languages people use
  • To convey meanings
  • To communicate between readers and authors.
  • Data or information?
  • If it can be understood, its information.
  • by Whom? A person or a system?

29
Documents
  • Logical unit of text
  • articles, books,
  • links, web pages
  • Other components that come with the text
  • figures, charts, graphics
  • multimedia

30
Textual Data
  • Repository of human intellectuals
  • Rich and diverse resources for all answers.
  • If it is written, it is there (in text)
  • Meaningful and understandable (to users).
  • Simple ASCII representation
  • Free of pre-formatted structures
  • continuous
  • separated into documents
  • Easy to process by the computer
  • Machine Intensive (not labor intensive)

31
Problems with Text
  • Massive
  • Any IR system needs the capability of large scale
    data processing.
  • Use of indexes and various representations are
    required.
  • Inconsistent
  • Its a human language
  • Syntactical and semantic variances
  • Same information expressed in different ways.
  • Different information expressed in similar ways.
  • Incomplete
  • It uses common knowledge.
  • Its an open system.

32
Retrieval
  • Retrieval
  • What do we retrieve?
  • Data
  • Information
  • Knowledge
  • We retrieve documents that contains text which
    carries information.
  • Information can be anywhere
  • in the text, in the links, in the process of text.

33
Information Retrieval
  • Are they the same?
  • Text retrieval
  • Document retrieval
  • Information retrieval

34
Information Retrieval
  • Conceptually, information retrieval is used to
    cover all related problems in finding needed
    information
  • Historically, information retrieval is about
    document retrieval, emphasizing document as the
    basic unit
  • Technically, information retrieval refers to
    (text) string manipulation, indexing, matching,
    querying, etc.

35
Summary
  • The goal of IR systems is to help users find
    information that satisfies their information
    needs.
  • The main process of IR systems is to match data
    abstracted from the real world to queries
    abstracted from users information needs.
  • Information retrieval is much more difficult than
    data retrieval.

36
Data Retrieval vs. Information Retrieval
  • Data retrieval Information
    retrieval
  • Content Data Information
  • Data object Table Document
  • Matching Exact match Partial match, best match
  • Items wanted Matching Relevant
  • Query language SQL(artificial) Natural
  • Query specification Complete Incomplete
  • Model Deterministic Probabilistic
  • Highly structured less structure
Write a Comment
User Comments (0)
About PowerShow.com