Information Retrieval - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Information Retrieval

Description:

http://en.wikipedia.org/wiki/Cryptography (C) 2003, The University of Michigan. 9 ... French: embassy dictionary learn (C) 2003, The University of Michigan. 25 ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 26
Provided by: dragomi3
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
January 21, 2005
  • Handout 3

2
Course Information
  • Instructor Dragomir R. Radev (radev_at_si.umich.edu)
  • Office 3080, West Hall Connector
  • Phone (734) 615-5225
  • Office hours M 11-12 Th 12-1 or via email
  • Course page http//tangra.si.umich.edu/radev/650
    /
  • Class meets on Fridays, 210-455 PM in 409 West
    Hall

3
Compression
4
Compression
  • Methods
  • Fixed length codes
  • Huffman coding
  • Ziv-Lempel codes

5
Fixed length codes
  • Binary representations
  • ASCII
  • Representational power (2k symbols where k is the
    number of bits)

6
Variable length codes
  • Alphabet
  • A .-  N -.  0 -----
  • B -...  O ---  1 .----
  • C -.-.  P .--.  2 ..---
  • D -..  Q --.-  3 ...
  • E .  R .-. 4 ....-
  • F ..-. S ... 5 .....
  • G --. T -  6 -....
  • H .... U ..-  7 --...
  • I ..  V ...-  8 ---..
  • J .---  W .--  9 ----.
  • K -.-  X -..-
  • L .-..  Y -.
  • M --  Z --..
  • Demo
  • http//www.babbage.demon.co.uk/morse.html
  • http//www.scphillips.com/morse/

7
Most frequent letters in English
  • Most frequent letters
  • E T A O I N S H R D L U
  • http//www.math.cornell.edu/mec/modules/cryptogra
    phy/subs/frequencies.html
  • Demo
  • http//www.amstat.org/publications/jse/secure/v7n2
    /count-char.cfm
  • Also bigrams
  • TH HE IN ER AN RE ND AT ON NT
  • http//www.math.cornell.edu/mec/modules/cryptogra
    phy/subs/digraphs.html

8
Useful links about cryptography
  • http//world.std.com/franl/crypto.html
  • http//www.faqs.org/faqs/cryptography-faq/
  • http//en.wikipedia.org/wiki/Cryptography

9
Huffman coding
  • Developed by David Huffman (1952)
  • Average of 5 bits per character (37.5
    compression)
  • Based on frequency distributions of symbols
  • Algorithm iteratively build a tree of symbols
    starting with the two least frequent symbols

10
(No Transcript)
11
0
1
0
1
1
0
g
0
0
0
1
1
1
i
j
f
c
0
1
0
1
b
d
a
0
1
h
e
12
(No Transcript)
13
Exercise
  • Consider the bit string 011011011110001001100011
    10100111000110101101011101
  • Use the Huffman code from the example to decode
    it.
  • Try inserting, deleting, and switching some bits
    at random locations and try decoding.

14
Extensions
  • Word-based
  • Domain/genre dependent models

15
Ziv-Lempel coding
  • Two types - one is known as LZ77 (used in GZIP)
  • Code set of triples
  • a how far back in the decoded text to look for
    the upcoming text segment
  • b how many characters to copy
  • c new character to add to complete segment

16
  • p
  • pe
  • pet
  • peter
  • peter_
  • peter_pi
  • peter_piper
  • peter_piper_pic
  • peter_piper_pick
  • peter_piper_picked
  • peter_piper_picked_a
  • peter_piper_picked_a_pe
  • peter_piper_picked_a_peck_
  • peter_piper_picked_a_peck_o
  • peter_piper_picked_a_peck_of
  • peter_piper_picked_a_peck_of_pickl
  • peter_piper_picked_a_peck_of_pickled
  • peter_piper_picked_a_peck_of_pickled_pep
  • peter_piper_picked_a_peck_of_pickled_peppe
    r

17
Links on text compression
  • Data compression
  • http//www.data-compression.info/
  • Calgary corpus
  • http//links.uwaterloo.ca/calgary.corpus.html
  • Huffman coding
  • http//www.compressconsult.com/huffman/
  • http//en.wikipedia.org/wiki/Huffman_coding
  • LZ
  • http//en.wikipedia.org/wiki/LZ77

18
Relevance feedback andquery expansion
19
Relevance feedback
  • Problem initial query may not be the most
    appropriate to satisfy a given information need.
  • Idea modify the original query so that it gets
    closer to the right documents in the vector space

20
Relevance feedback
  • Automatic
  • Manual
  • Method identifying feedback terms
  • Q a1Q a2R - a3N
  • Often a1 1, a2 1/R and a3 1/N

21
Example
  • Q safety minivans
  • D1 car safety minivans tests injury
    statistics - relevant
  • D2 liability tests safety - relevant
  • D3 car passengers injury reviews -
    non-relevant
  • R ?
  • S ?
  • Q ?

22
Pseudo relevance feedback
  • Automatic query expansion
  • Thesaurus-based expansion (e.g., using latent
    semantic indexing later)
  • Distributional similarity
  • Query log mining

23
Examples
Lexical semantics (Hypernymy)
Book publication, product, fact, dramatic
composition, record Computer machine, expert,
calculator, reckoner, figurer Fruit
reproductive structure, consequence, product,
bear Politician leader, schemer Newspaper
press, publisher, product, paper, newsprint
Distributional clustering
Book autobiography, essay, biography, memoirs,
novels Computer adobe, computing, computers,
developed, hardware Fruit leafy, canned, fruits,
flowers, grapes Politician activist, campaigner,
politicians, intellectuals, journalist Newspaper
daily, globe, newspapers, newsday, paper
24
Examples (query logs)
  • Book booksellers, bookmark, blue
  • Computer sales, notebook, stores, shop
  • Fruit recipes cake salad basket company
  • Games online play gameboy free video
  • Politician careers federal office history
  • Newspaper online website college information
  • Schools elementary high ranked yearbook
  • California berkeley san francisco southern
  • French embassy dictionary learn

25
Problems with automatic query expansion
  • Adding frequent words may dilute the results
    (example?)
Write a Comment
User Comments (0)
About PowerShow.com