Title: Search Engine Technology 1 http:www.cs.columbia.eduradevSET07.html
1Search Engine Technology (1)http//www.cs.columb
ia.edu/radev/SET07.html
- September 6, 2007
- Prof. Dragomir R. Radev
- radev_at_umich.edu
2SET Fall 2007
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7Examples of search engines
- Conventional (library catalog). Search by
keyword, title, author, etc. - Text-based (Lexis-Nexis, Google, Yahoo!).Search
by keywords. Limited search using queries in
natural language. - Multimedia (QBIC, WebSeek, SaFe)Search by visual
appearance (shapes, colors, ). - Question answering systems (Ask, NSIR,
Answerbus)Search in (restricted) natural
language - Clustering systems (Vivisimo, Clusty)
- Research systems (Lemur, Nutch)
8What does it take to build a search engine?
- Decide what to index
- Collect it
- Index it (efficiently)
- Keep the index up to date
- Provide user-friendly query facilities
9What else?
- Understand the structure of the web for efficient
crawling - Understand user information needs
- Preprocess text and other unstructured data
- Cluster data
- Classify data
- Evaluate performance
10Goals of the course
- Understand how to collect, store, index, analyze,
search and present large quantities of
unstructured text. - Understand the dynamics of the Web by building
appropriate mathematical models. - Build working systems that assist users in
finding useful information on the Web. - Understand and use third party software.
11Course logistics
- Thursdays 6-8 PM in 833 Mudd
- Office hour tba
- Web site http//www.cs.columbia.edu/radev/SET07
- Instructor Dragomir Radev (PhD, Columbia-CS)
- Email radev_at_umich.edu (please do not send me
mail at Columbia) - TA tba
12Course outline
- Classic document retrieval storing, indexing,
retrieval. - Web retrieval crawling, query processing.
- Text and web mining classification, clustering.
- Network analysis random graph models,
centrality, diameter and clustering coefficient.
13Syllabus
- Introduction.
- Queries and Documents. Models of Information
retrieval. The Boolean model. The Vector model. - Document preprocessing. Tokenization. Stemming.
The Porter algorithm. Storing, indexing and
searching text. Inverted indexes. - Word distributions. The Zipf distribution. The
Benford distribution. Heap's law. TFIDF. Vector
space similarity and ranking. - Retrieval evaluation. Precision and Recall.
F-measure. Reference collections. The TREC
conferences. - Automated indexing/labeling. Compression and
coding. Optimal codes. - String matching. Approximate matching.
- Query expansion. Relevance feedback.
14Syllabus
- Text classification. Naive Bayes. Feature
selection. Decision trees. - Linear classifiers. k-nearest neighbors.
Perceptron. Kernel methods. Maximum-margin
classifiers. Support vector machines.
Semi-supervised learning. - Lexical semantics and Wordnet.
- Latent semantic indexing. Singular value
decomposition. - Vector space clustering. k-means clustering. EM
clustering. - Crawling the web. Webometrics. Measuring the size
of the web. The Bow-tie model. - Hypertext retrieval. Web-based IR. Document
closures. Focused crawling.
15Syllabus
- Random graph models. Properties of random graphs
clustering coefficient, betweenness, diameter,
giant connected component, degree distribution. - Social network analysis. Small worlds and
scale-free networks. Power law distributions.
Centrality. - Graph-based methods. Harmonic functions. Random
walks. PageRank. - Models of the web. Hubs and authorities.
Bipartite graphs. HITS and SALSA. - Discovering communities. Spectral clustering.
- Semi-supervised passage retrieval. Question
answering. - Probabilistic models of IR. Language models.
Burstiness. Self-triggerability. - Quick overview of topics such as Collaborative
filtering. Recommendation systems. Information
extraction. Hidden Markov Models. Conditional
Random Fields. User behavior . - Quick overview of topics such as Adversarial IR.
Spamming and anti-spamming methods. Text
summarization. - Student presentations (between Dec 14 and Dec 21)
- Additional topics (if time permits)
- Natural language processing. XML retrieval. Text
tiling. Human behavior on the web.
16Readings
- required Information Retrieval by Manning,
Schuetze, and Raghavan (http//www-csli.stanford.e
du/schuetze/information-retrieval-book.html),
freely available, mirrored on August 14, 2007. - optional Modeling the Internet and the Web
Probabilistic Methods and Algorithms by Pierre
Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
u). - papers from SIGIR, WWW and journals (to be
announced in class).
17Prerequisites
- Linear algebra vectors, matrices, and operations
on them, determinants, eigenvectors. - Calculus differentiation, finding extrema of
functions. - Probabilities random variables, discrete and
continuous distributions, Bayes theorem. - Programming experience with at least one
web-aware programming language such as Perl
(highly recommended) or Java in a UNIX
environment. - Required CS account (check CS web site)
18Course requirements
- Three (mostly programming) assignments (40)
- Some of them will be in Perl. The rest can be
done in any appropriate language. - Reading assignments (10)
- Final project (40)
- Students will present their final project in a
poster session in class. - Class participation (10)
- No final exam.
19Final project format
- Research paper - using the SIGIR format. Students
will be in charge of problem formulation,
literature survey, hypothesis formulation,
experimental design, implementation, and possibly
submission to a conference like SIGIR or WWW. - Software system - develop a working system or
API. Students will be responsible for identifying
a niche problem, implementing it and deploying
it, either on the Web or as an open-source
downloadable tool. The system can be either stand
alone or an extension to an existing one.
20Project ideas
- Build a question answering system.
- Build a language identification system.
- Social network analysis from the Web.
- Participate in the Netflix challenge.
- Query log analysis.
- Build models of Web evolution.
- Information diffusion in blogs or web.
- Author-topic models of web pages.
- Using the web for machine translation.
- Building evolving models of web documents.
- News recommendation system.
- Compress the text of Wikipedia (losslessly).
- Spelling correction using query logs.
- Automatic query expansion.
21List of projects from last year
- Document Closures for Indexing
- Tibet - Table Structure Recognition Library
- Ruby Blog Memetracker
- Sentence decomposition for more accurate
information retrieval - Extracting Social Networks from LiveJournal
- Google Suggest Programming Project (Java Swing
Client and Lucene Back-End) - Leveraging Social Networks for Organizing and
Browsing Shared Photographs - Media Bias and the Political Blogosphere
- Measuring Similarity between search queries
- Extracting Social Networks and Information about
the people within them from Text - LSI dependency trees
22Available corpora
- Enron email
- CIA world factbook
- DBLP papers in CS
- NNDB information about people
- BLOGS collection of blogs
- US congressional speeches
- AOL queries
- Netflix recommendations
- IMDB
- NIE news articles
- PUBMED biomedical paper abstracts
- Wikipedia
- ACL Anthology collection of papers in NLP/CL
- DOTGOV download of .GOV
- biocreative biomedical papers
- WT100G 100GB download of the web
- Google n-grams
- webfreq frequency of words on the web
- SMS corpus
- Citeseer CS papers
- DMOZ the open directory project
- corpus of paraphrases
- multilingual parallel parliamentary proceedings
- textual entailment corpus
- question answering corpus
- summarization corpus
- various text classification corpora
(Reuters-21578, 20NG) - Peekaboom (from the game)
23Related courses elsewhere
- Stanford (Chris Manning, Prabhakar Raghavan, and
Hinrich Schuetze) - Cornell (Jon Kleinberg)
- CMU (Yiming Yang and Jamie Callan)
- UMass (James Allan)
- UTexas (Ray Mooney)
- Illinois (Chengxiang Zhai)
- Johns Hopkins (David Yarowsky)
- For a long list of courses related to Search
Engines, Natural Language Processing, Machine
Learning look herehttp//clair.si.umich.edu808
0/wordpress/?p11
24SET Fall 2007
2. Models of Information retrieval The
Vector model The Boolean model
25Sample queries (from Excite)
- In what year did baseball become an offical
sport? - play station codes . com
- birth control and depression
- government
- "WorkAbility I"conference
- kitchen appliances
- where can I find a chines rosewood
- tiger electronics
- 58 Plymouth Fury
- How does the character Seyavash in Ferdowsi's
Shahnameh exhibit characteristics of a hero? - emeril Lagasse
- Hubble
- M.S Subalaksmi
- running
26Key Terms Used in IR
- QUERY a representation of what the user is
looking for - can be a list of words or a phrase. - DOCUMENT an information entity that the user
wants to retrieve - COLLECTION a set of documents
- INDEX a representation of information that makes
querying easier - TERM word or concept that appears in a document
or a query
27Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
28Documents
- Not just printed paper
- Can be records, pages, sites, images, people,
movies - Document encoding (Unicode)
- Document representation
- Document preprocessing
29Sample query sessions (from AOL)
- toley spies gramestolley spies gamestotally
spies games - tajmahal restaurant brooklyn nytaj mahal
restaurant brooklyn nytaj mahal restaurant
brooklyn ny 11209 - do you love me like you saydo you love me like
you say lyricsdo you love me like you say lyrics
marvin gaye
30Characteristics of user queries
- Sessions users revisit their queries.
- Very short queries typically 2 words long.
- A large number of typos.
- A small number of popular queries. A long tail of
infrequent ones. - Almost no use of advanced query operators with
the exception of double quotes
31Queries as documents
- Advantages
- Mathematically easier to manage
- Problems
- Different lengths
- Syntactic differences
- Repetitions of words (or lack thereof)
32Document representations
- Term-document matrix (m x n)
- Document-document matrix (n x n)
- Typical example in a medium-sized collection
3,000,000 documents (n) with 50,000 terms (m) - Typical example on the Web n30,000,000,000,
m1,000,000 - Boolean vs. integer-valued matrices
33Major IR models
- Boolean
- Vector
- Probabilistic
- Language modeling
- Fuzzy retrieval
- Latent semantic indexing
34The Boolean model
Venn diagrams
z
x
w
y
D1
D2
35Boolean queries
- Operators AND, OR, NOT, parentheses
- Example
- CLEVELAND AND NOT OHIO
- (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
- Ambiguous uses of AND and OR in human language
- Exclusive vs. inclusive OR
- Restrictive operator AND or OR?
36Canonical forms of queries
NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
- Normal forms
- Conjunctive normal form (CNF)
- Disjunctive normal form (DNF)
- Reference librarians prefer CNF - why?
37Evaluating Boolean queries
- Incidence vectors
- CLEVELAND 1100010
- OHIO 1000111
- Examples
- CLEVELAND AND OHIO
- CLEVELAND AND NOT OHIO
- CLEVALAND OR OHIO
38Exercise
- D1 computer information retrieval
- D2 computer retrieval
- D3 information
- D4 computer information
- Q1 information AND retrieval
- Q2 information AND NOT computer
39Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
40How to deal with?
- Multi-word phrases?
- Document ranking?
41The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
42Vector queries
- Each document is represented as a vector
- Non-efficient representation
- Dimensional compatibility
43The matching process
- Document space
- Matching is done between a document and a query
(or between two documents) - Distance vs. similarity measures.
- Euclidean distance, Manhattan distance, Word
overlap, Jaccard coefficient, etc.
44Miscellaneous similarity measures
- The Cosine measure (normalized dot product)
? (di x qi)
X ? Y
? (D,Q)
? (di)2
? (qi)2
X Y
X ? Y
? (D,Q)
X ? Y
45Exercise
- Compute the cosine scores ? (D1,D2) and ? (D1,D3)
for the documents D1 , D2 and
D3 - Compute the corresponding Euclidean distances,
Manhattan distances, and Jaccard coefficients.
46Readings
- For September 13 MRS1, MRS2, MRS5 (Zipf)
- For September 20 MRS7, MRS8