Title: Search Engine Technology
1Search Engine Technology(1)
Prof. Dragomir R. Radev radev_at_cs.columbia.edu
2SET FALL 2009
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7Examples of search engines
- Conventional (library catalog). Search by
keyword, title, author, etc. - Text-based (Lexis-Nexis, Google, Yahoo!).Search
by keywords. Limited search using queries in
natural language. - Multimedia (QBIC, WebSeek, SaFe)Search by visual
appearance (shapes, colors, ). - Question answering systems (Ask, NSIR,
Answerbus)Search in (restricted) natural
language - Clustering systems (Vivísimo, Clusty)
- Research systems (Lemur, Nutch)
8What does it take to build a search engine?
- Decide what to index
- Collect it
- Index it (efficiently)
- Keep the index up to date
- Provide user-friendly query facilities
9What else?
- Understand the structure of the web for efficient
crawling - Understand user information needs
- Preprocess text and other unstructured data
- Cluster data
- Classify data
- Evaluate performance
10Goals of the course
- Understand how search engines work
- Understand the limits of existing search
technology - Learn to appreciate the sheer size of the Web
- Learn to wrote code for text indexing and
retrieval - Learn about the state of the art in IR research
- Learn to analyze textual and semi-structured data
sets - Learn to appreciate the diversity of texts on the
Web - Learn to evaluate information retrieval
- Learn about standardized document collections
- Learn about text similarity measures
- Learn about semantic dimensionality reduction
- Learn about the idiosyncracies of hyperlinked
document collections - Learn about web crawling
- Learn to use existing software
- Understand the dynamics of the Web by building
appropriate mathematical models - Build working systems that assist users in
finding useful information on the Web
11Course logistics
- Thursdays 610-800
- Office hours TBA
- URL http//www.cs.columbia.edu/cs6998
- Instructor Dragomir Radev
- Email radev_at_cs.columbia.edu
- TA
- Yves Petinot (ypetinot_at_cs.columbia.edu)
- Kaushal Lahankar (knl2102_at_columbia.edu)
12Course outline
- Classic document retrieval storing, indexing,
retrieval. - Web retrieval crawling, query processing.
- Text and web mining classification, clustering.
- Network analysis random graph models,
centrality, diameter and clustering coefficient.
13Syllabus
- Introduction.
- Queries and Documents. Models of Information
retrieval. The Boolean model. The Vector model. - Document preprocessing. Tokenization. Stemming.
The Porter algorithm. Storing, indexing and
searching text. Inverted indexes. - Word distributions. The Zipf distribution. The
Benford distribution. Heap's law. TFIDF. Vector
space similarity and ranking. - Retrieval evaluation. Precision and Recall.
F-measure. Reference collections. The TREC
conferences. - Automated indexing/labeling. Compression and
coding. Optimal codes. - String matching. Approximate matching.
- Query expansion. Relevance feedback.
- Text classification. Naive Bayes. Feature
selection. Decision trees.
14Syllabus
- Linear classifiers. k-nearest neighbors.
Perceptron. Kernel methods. Maximum-margin
classifiers. Support vector machines.
Semi-supervised learning. - Lexical semantics and Wordnet.
- Latent semantic indexing. Singular value
decomposition. - Vector space clustering. k-means clustering. EM
clustering. - Random graph models. Properties of random graphs
clustering coefficient, betweenness, diameter,
giant connected component, degree distribution. - Social network analysis. Small worlds and
scale-free networks. Power law distributions.
Centrality. - Graph-based methods. Harmonic functions. Random
walks. - PageRank. Hubs and authorities. Bipartite graphs.
HITS. - Models of the Web.
15Syllabus
- Crawling the web. Webometrics. Measuring the size
of the web. The Bow-tie-method. - Hypertext retrieval. Web-based IR. Document
closures. Focused crawling. - Question answering
- Burstiness. Self-triggerability
- Information extraction
- Adversarial IR. Human behavior on the web.
- Text summarization
- POSSIBLE TOPICS
- Discovering communities, spectral clustering
- Semi-supervised retrieval
- Natural language processing. XML retrieval. Text
tiling. Human behavior on the web.
16Readings
- required Information Retrieval by Manning,
Schuetze, and Raghavan (http//www-csli.stanford.e
du/schuetze/information-retrieval-book.html),
freely available, hard copy for sale - optional Modeling the Internet and the Web
Probabilistic Methods and Algorithms by Pierre
Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
u). - papers from SIGIR, WWW and journals (to be
announced in class).
17Prerequisites
- Linear algebra vectors and matrices.
- Calculus Finding extrema of functions.
- Probabilities random variables, discrete and
continuous distributions, Bayes theorem. - Programming experience with at least one
web-aware programming language such as Perl
(highly recommended) or Java in a UNIX
environment. - Required CS account
18Course requirements
- Three assignments (30)
- Some of them will be in Perl. The rest can be
done in any appropriate language. All will
involve some data analysis and evaluation - Final project (30)
- Research paper or software system.
- Class participation (10)
- Final exam (30)
19Final project format
- Research paper - using the SIGIR format. Students
will be in charge of problem formulation,
literature survey, hypothesis formulation,
experimental design, implementation, and possibly
submission to a conference like SIGIR or WWW. - Software system - develop a working system or
API. Students will be responsible for identifying
a niche problem, implementing it and deploying
it, either on the Web or as an open-source
downloadable tool. The system can be either stand
alone or an extension to an existing one.
20Project ideas
- Build a question answering system.
- Build a language identification system.
- Social network analysis from the Web.
- Participate in the Netflix challenge.
- Query log analysis.
- Build models of Web evolution.
- Information diffusion in blogs or web.
- Author-topic models of web pages.
- Using the web for machine translation.
- Building evolving models of web documents.
- News recommendation system.
- Compress the text of Wikipedia (losslessly).
- Spelling correction using query logs.
- Automatic query expansion.
21List of projects from the past
- Document Closures for Indexing
- Tibet - Table Structure Recognition Library
- Ruby Blog Memetracker
- Sentence decomposition for more accurate
information retrieval - Extracting Social Networks from LiveJournal
- Google Suggest Programming Project (Java Swing
Client and Lucene Back-End) - Leveraging Social Networks for Organizing and
Browsing Shared Photographs - Media Bias and the Political Blogosphere
- Measuring Similarity between search queries
- Extracting Social Networks and Information about
the people within them from Text - LSI dependency trees
22Available corpora
- Netflix challenge
- AOL query logs
- Blogs
- Bio papers
- AAN
- Email
- Generifs
- Web pages
- Political science corpus
- VAST
- del.icio.us
- SMS
- News data aquaint, tdt, nantc, reuters, setimes,
trec, tipster - Europarl multilingual
- US congressional data
- DMOZ
- Pubmedcentral
- DUC/TAC
- Timebank
- Wikipedia
- wt2g/wt10g/wt100g
- dotgov
- RTE
- Paraphrases
- GENIA
- Generifs
- Hansards
- IMDB
- MTA/MTC
- nie
- cnnsumm
- Poliblog
- Sentiment
- xml
- epinions
- Enron
23Related courses elsewhere
- Stanford (Chris Manning, Prabhakar Raghavan, and
Hinrich Schuetze) - Cornell (Jon Kleinberg)
- CMU (Yiming Yang and Jamie Callan)
- UMass (James Allan)
- UTexas (Ray Mooney)
- Illinois (Chengxiang Zhai)
- Johns Hopkins (David Yarowsky)
- For a long list of courses related to Search
Engines, Natural Language Processing, Machine
Learning look here http//tangra.si.umich.edu/c
lair/clair/courses.html
24SET FALL 2009
2. Models of Information retrieval The
Vector model The Boolean model
25The web is really large
- 100 B pages
- Dynamically generated content
- New pages get added all the time
- Technorati has 50M blogs
- The size of the blogosphere doubles every 6
months - Yahoo deals with 12TB of data per day (according
to Ron Brachman)
26Sample queries (from Excite)
In what year did baseball become an offical
sport? play station codes . com birth control and
depression government "WorkAbility
I"conference kitchen appliances where can I find
a chines rosewood tiger electronics 58 Plymouth
Fury How does the character Seyavash in
Ferdowsi's Shahnameh exhibit characteristics of a
hero? emeril Lagasse Hubble M.S Subalaksmi running
27Fun things to do with search engines
- Googlewhack
- Reduce document set size to 1
- Find query that will bring given URL in the top
10
28Key Terms Used in IR
- QUERY a representation of what the user is
looking for - can be a list of words or a phrase. - DOCUMENT an information entity that the user
wants to retrieve - COLLECTION a set of documents
- INDEX a representation of information that makes
querying easier - TERM word or concept that appears in a document
or a query
29Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
30Documents
- Not just printed paper
- Can be records, pages, sites, images, people,
movies - Document encoding (Unicode)
- Document representation
- Document preprocessing
31Sample query sessions (from AOL)
- toley spies gramestolley spies gamestotally
spies games - tajmahal restaurant brooklyn nytaj mahal
restaurant brooklyn nytaj mahal restaurant
brooklyn ny 11209 - do you love me like you saydo you love me like
you say lyricsdo you love me like you say lyrics
marvin gaye
M /data4/corpora/AOL-user-ct-collection
32Characteristics of user queries
- Sessions users revisit their queries.
- Very short queries typically 2 words long.
- A large number of typos.
- A small number of popular queries. A long tail of
infrequent ones. - Almost no use of advanced query operators with
the exception of double quotes
33Queries as documents
- Advantages
- Mathematically easier to manage
- Problems
- Different lengths
- Syntactic differences
- Repetitions of words (or lack thereof)
34Document representations
- Term-document matrix (m x n)
- Document-document matrix (n x n)
- Typical example in a medium-sized collection
3,000,000 documents (n) with 50,000 terms (m) - Typical example on the Web n30,000,000,000,
m1,000,000 - Boolean vs. integer-valued matrices
35Storage issues
- Imagine a medium-sized collection with
n3,000,000 and m50,000 - How large a term-document matrix will be needed?
- Is there any way to do better? Any heuristic?
36Inverted index
- Instead of an incidence vector, use a posting
table - CLEVELAND D1, D2, D6
- OHIO D1, D5, D6, D7
- Use linked lists to be able to insert new
document postings in order and to remove existing
postings. - Keep everything sorted! This gives you a
logarithmic improvement in access.
37Basic operations on inverted indexes
- Conjunction (AND) iterative merge of the two
postings O(xy) - Disjunction (OR) very similar
- Negation (NOT) can we still do it in O(xy)?
- Example MICHIGAN AND NOT OHIO
- Example MICHIGAN OR NOT OHIO
- Recursive operations
- Optimization start with the smallest sets
38Major IR models
- Boolean
- Vector
- Probabilistic
- Language modeling
- Fuzzy retrieval
- Latent semantic indexing
39The Boolean model
Venn diagrams
z
x
w
y
D1
D2
40Boolean queries
- Operators AND, OR, NOT, parentheses
- Example
- CLEVELAND AND NOT OHIO
- (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
- Ambiguous uses of AND and OR in human language
- Exclusive vs. inclusive OR
- Restrictive operator AND or OR?
41Canonical forms of queries
NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
- Normal forms
- Conjunctive normal form (CNF)
- Disjunctive normal form (DNF)
- Reference librarians prefer CNF - why?
42Evaluating Boolean queries
- Incidence vectors
- CLEVELAND 1100010
- OHIO 1000111
- Examples
- CLEVELAND AND OHIO
- CLEVELAND AND NOT OHIO
- CLEVALAND OR OHIO
43Exercise
- D1 computer information retrieval
- D2 computer retrieval
- D3 information
- D4 computer information
- Q1 information AND retrieval
- Q2 information AND NOT computer
44Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
45How to deal with?
- Multi-word phrases?
- Document ranking?
46The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
47Vector queries
- Each document is represented as a vector
- Non-efficient representation
- Dimensional compatibility
48The matching process
- Document space
- Matching is done between a document and a query
(or between two documents) - Distance vs. similarity measures.
- Euclidean distance, Manhattan distance, Word
overlap, Jaccard coefficient, etc.
49Miscellaneous similarity measures
- The Cosine measure (normalized dot product)
? (di x qi)
X ? Y
? (D,Q)
? (di)2
? (qi)2
X Y
X ? Y
? (D,Q)
X ? Y
50Exercise
- Compute the cosine scores ? (D1,D2) and ? (D1,D3)
for the documents D1 lt1,3gt, D2 lt100,300gt and
D3 lt3,1gt - Compute the corresponding Euclidean distances,
Manhattan distances, and Jaccard coefficients.
51Readings
- (1) MRS1, MRS2, MRS5 (Zipf)
- (2) MRS7, MRS8