Title: Information Retrieval
1Information Retrieval
January 21, 2005
2Course Information
- Instructor Dragomir R. Radev (radev_at_si.umich.edu)
- Office 3080, West Hall Connector
- Phone (734) 615-5225
- Office hours M 11-12 Th 12-1 or via email
- Course page http//tangra.si.umich.edu/radev/650
/ - Class meets on Fridays, 210-455 PM in 409 West
Hall
3Compression
4Compression
- Methods
- Fixed length codes
- Huffman coding
- Ziv-Lempel codes
5Fixed length codes
- Binary representations
- ASCII
- Representational power (2k symbols where k is the
number of bits)
6Variable length codes
- Alphabet
- A .- N -. 0 -----
- B -... O --- 1 .----
- C -.-. P .--. 2 ..---
- D -.. Q --.- 3 ...
- E . R .-. 4 ....-
- F ..-. S ... 5 .....
- G --. T -Â 6 -....
- H .... U ..-Â 7 --...
- I .. V ...- 8 ---..
- J .---Â W .--Â 9 ----.
- K -.-Â X -..-
- L .-.. Y -.
- M --Â Z --..
- Demo
- http//www.babbage.demon.co.uk/morse.html
- http//www.scphillips.com/morse/
7Most frequent letters in English
- Most frequent letters
- E T A O I N S H R D L U
- http//www.math.cornell.edu/mec/modules/cryptogra
phy/subs/frequencies.html - Demo
- http//www.amstat.org/publications/jse/secure/v7n2
/count-char.cfm - Also bigrams
- TH HE IN ER AN RE ND AT ON NT
- http//www.math.cornell.edu/mec/modules/cryptogra
phy/subs/digraphs.html
8Useful links about cryptography
- http//world.std.com/franl/crypto.html
- http//www.faqs.org/faqs/cryptography-faq/
- http//en.wikipedia.org/wiki/Cryptography
9Huffman coding
- Developed by David Huffman (1952)
- Average of 5 bits per character (37.5
compression) - Based on frequency distributions of symbols
- Algorithm iteratively build a tree of symbols
starting with the two least frequent symbols
10(No Transcript)
110
1
0
1
1
0
g
0
0
0
1
1
1
i
j
f
c
0
1
0
1
b
d
a
0
1
h
e
12(No Transcript)
13Exercise
- Consider the bit string 011011011110001001100011
10100111000110101101011101 - Use the Huffman code from the example to decode
it. - Try inserting, deleting, and switching some bits
at random locations and try decoding.
14Extensions
- Word-based
- Domain/genre dependent models
15Ziv-Lempel coding
- Two types - one is known as LZ77 (used in GZIP)
- Code set of triples
- a how far back in the decoded text to look for
the upcoming text segment - b how many characters to copy
- c new character to add to complete segment
16- p
- pe
- pet
- peter
- peter_
- peter_pi
- peter_piper
- peter_piper_pic
- peter_piper_pick
- peter_piper_picked
- peter_piper_picked_a
- peter_piper_picked_a_pe
- peter_piper_picked_a_peck_
- peter_piper_picked_a_peck_o
- peter_piper_picked_a_peck_of
- peter_piper_picked_a_peck_of_pickl
- peter_piper_picked_a_peck_of_pickled
- peter_piper_picked_a_peck_of_pickled_pep
- peter_piper_picked_a_peck_of_pickled_peppe
r
17Links on text compression
- Data compression
- http//www.data-compression.info/
- Calgary corpus
- http//links.uwaterloo.ca/calgary.corpus.html
- Huffman coding
- http//www.compressconsult.com/huffman/
- http//en.wikipedia.org/wiki/Huffman_coding
- LZ
- http//en.wikipedia.org/wiki/LZ77
18Relevance feedback andquery expansion
19Relevance feedback
- Problem initial query may not be the most
appropriate to satisfy a given information need. - Idea modify the original query so that it gets
closer to the right documents in the vector space
20Relevance feedback
- Automatic
- Manual
- Method identifying feedback terms
- Q a1Q a2R - a3N
- Often a1 1, a2 1/R and a3 1/N
21Example
- Q safety minivans
- D1 car safety minivans tests injury
statistics - relevant - D2 liability tests safety - relevant
- D3 car passengers injury reviews -
non-relevant - R ?
- S ?
- Q ?
22Pseudo relevance feedback
- Automatic query expansion
- Thesaurus-based expansion (e.g., using latent
semantic indexing later) - Distributional similarity
- Query log mining
23Examples
Lexical semantics (Hypernymy)
Book publication, product, fact, dramatic
composition, record Computer machine, expert,
calculator, reckoner, figurer Fruit
reproductive structure, consequence, product,
bear Politician leader, schemer Newspaper
press, publisher, product, paper, newsprint
Distributional clustering
Book autobiography, essay, biography, memoirs,
novels Computer adobe, computing, computers,
developed, hardware Fruit leafy, canned, fruits,
flowers, grapes Politician activist, campaigner,
politicians, intellectuals, journalist Newspaper
daily, globe, newspapers, newsday, paper
24Examples (query logs)
- Book booksellers, bookmark, blue
- Computer sales, notebook, stores, shop
- Fruit recipes cake salad basket company
- Games online play gameboy free video
- Politician careers federal office history
- Newspaper online website college information
- Schools elementary high ranked yearbook
- California berkeley san francisco southern
- French embassy dictionary learn
25Problems with automatic query expansion
- Adding frequent words may dilute the results
(example?)