Title: Search Engine Technology
1Search Engine Technology4http//www.cs.columbia
.edu/radev/SET07.html
- September 27, 2007
- Prof. Dragomir R. Radev
- radev_at_umich.edu
2SET Fall 2007
7. Approximate string matching
3Levenshtein edit distance
- Examples
- Theatre-gt theater
- Ghaddafi-gtQadafi
- Computer-gtcounter
- Edit distance (inserts, deletes, substitutions)
- Edit transcript
- Done through dynamic programming
4Recurrence relation
- Three dependencies
- D(i,0)i
- D(0,j)j
- D(i,j)minD(i-1,j)1,D(1,j-1)1,D(i-1,j-1)t(i,j)
- Simple edit distance
- t(i,j) 0 iff S1(i)S2(j)
5Example
Gusfield 1997
6Example (contd)
Gusfield 1997
7Tracebacks
Gusfield 1997
8Weighted edit distance
- Used to emphasize the relative cost of different
edit operations - Useful in bioinformatics
- Homology information
- BLAST
- Blosum
- http//eta.embl-heidelberg.de8000/misc/mat/blosum
50.html
9Links
- Web sites
- http//www.merriampark.com/ld.htm
- http//odur.let.rug.nl/kleiweg/lev/
- Demo
- /home/cs6998/tools/editDistance/dp/l.pl theater
theatre - http//nayana.ece.ucsb.edu/imsearch/imsearch.html
10Other methods
- Cosine
- Generation probabilities (language modeling)
- (exp)KL-divergence
11SET Winter 2007
8. Query expansion Relevance feedback
12Query expansion
13Query expansion
- Corpus-based mine query logs
- NLP-based
- Vector-space relevance feedback
14Relevance feedback
- Problem initial query may not be the most
appropriate to satisfy a given information need. - Idea modify the original query so that it gets
closer to the right documents in the vector space
15Relevance feedback
- Automatic
- Manual
- Method identifying feedback terms
- Q a1Q a2R - a3N
- Often a1 1, a2 1/R and a3 1/N
16Example
- Q safety minivans
- D1 car safety minivans tests injury
statistics - relevant - D2 liability tests safety - relevant
- D3 car passengers injury reviews -
non-relevant - R ?
- S ?
- Q ?
17Pseudo relevance feedback
- Automatic query expansion
- Thesaurus-based expansion (e.g., using latent
semantic indexing later) - Distributional similarity
- Query log mining
18Examples
Lexical semantics (Hypernymy)
Book publication, product, fact, dramatic
composition, record Computer machine, expert,
calculator, reckoner, figurer Fruit
reproductive structure, consequence, product,
bear Politician leader, schemer Newspaper
press, publisher, product, paper, newsprint
Distributional clustering
Book autobiography, essay, biography, memoirs,
novels Computer adobe, computing, computers,
developed, hardware Fruit leafy, canned, fruits,
flowers, grapes Politician activist, campaigner,
politicians, intellectuals, journalist Newspaper
daily, globe, newspapers, newsday, paper
19Examples (query logs)
- Book booksellers, bookmark, blue
- Computer sales, notebook, stores, shop
- Fruit recipes cake salad basket company
- Games online play gameboy free video
- Politician careers federal office history
- Newspaper online website college information
- Schools elementary high ranked yearbook
- California berkeley san francisco southern
- French embassy dictionary learn
20Otterbacher et al. HLT EMNLP 2005
21(No Transcript)
22Readings
- For February 21 MRS15, MRS16
- For February 28 MRS17
- For March 7 MRS18, MRS19