Search Engine Technology

1 / 51

About This Presentation

Title:

Search Engine Technology

Description:

Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance ... Calculus: Finding extrema of functions. ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 52

Provided by: rad2

Learn more at: http://www1.cs.columbia.edu

more less

Transcript and Presenter's Notes

Title: Search Engine Technology

1
Search Engine Technology(1)
Prof. Dragomir R. Radev radev_at_cs.columbia.edu
2
SET FALL 2009

Introduction

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Examples of search engines

Conventional (library catalog). Search by
keyword, title, author, etc.
Text-based (Lexis-Nexis, Google, Yahoo!).Search
by keywords. Limited search using queries in
natural language.
Multimedia (QBIC, WebSeek, SaFe)Search by visual
appearance (shapes, colors, ).
Question answering systems (Ask, NSIR,
Answerbus)Search in (restricted) natural
language
Clustering systems (Vivísimo, Clusty)
Research systems (Lemur, Nutch)

8
What does it take to build a search engine?

Decide what to index
Collect it
Index it (efficiently)
Keep the index up to date
Provide user-friendly query facilities

9
What else?

Understand the structure of the web for efficient
crawling
Understand user information needs
Preprocess text and other unstructured data
Cluster data
Classify data
Evaluate performance

10
Goals of the course

Understand how search engines work
Understand the limits of existing search
technology
Learn to appreciate the sheer size of the Web
Learn to wrote code for text indexing and
retrieval
Learn about the state of the art in IR research
Learn to analyze textual and semi-structured data
sets
Learn to appreciate the diversity of texts on the
Web
Learn to evaluate information retrieval
Learn about standardized document collections
Learn about text similarity measures
Learn about semantic dimensionality reduction
Learn about the idiosyncracies of hyperlinked
document collections
Learn about web crawling
Learn to use existing software
Understand the dynamics of the Web by building
appropriate mathematical models
Build working systems that assist users in
finding useful information on the Web

11
Course logistics

Thursdays 610-800
Office hours TBA
URL http//www.cs.columbia.edu/cs6998
Instructor Dragomir Radev
Email radev_at_cs.columbia.edu
TA
Yves Petinot (ypetinot_at_cs.columbia.edu)
Kaushal Lahankar (knl2102_at_columbia.edu)

12
Course outline

Classic document retrieval storing, indexing,
retrieval.
Web retrieval crawling, query processing.
Text and web mining classification, clustering.
Network analysis random graph models,
centrality, diameter and clustering coefficient.

13
Syllabus

Introduction.
Queries and Documents. Models of Information
retrieval. The Boolean model. The Vector model.
Document preprocessing. Tokenization. Stemming.
The Porter algorithm. Storing, indexing and
searching text. Inverted indexes.
Word distributions. The Zipf distribution. The
Benford distribution. Heap's law. TFIDF. Vector
space similarity and ranking.
Retrieval evaluation. Precision and Recall.
F-measure. Reference collections. The TREC
conferences.
Automated indexing/labeling. Compression and
coding. Optimal codes.
String matching. Approximate matching.
Query expansion. Relevance feedback.
Text classification. Naive Bayes. Feature
selection. Decision trees.

14
Syllabus

Linear classifiers. k-nearest neighbors.
Perceptron. Kernel methods. Maximum-margin
classifiers. Support vector machines.
Semi-supervised learning.
Lexical semantics and Wordnet.
Latent semantic indexing. Singular value
decomposition.
Vector space clustering. k-means clustering. EM
clustering.
Random graph models. Properties of random graphs
clustering coefficient, betweenness, diameter,
giant connected component, degree distribution.
Social network analysis. Small worlds and
scale-free networks. Power law distributions.
Centrality.
Graph-based methods. Harmonic functions. Random
walks.
PageRank. Hubs and authorities. Bipartite graphs.
HITS.
Models of the Web.

15
Syllabus

Crawling the web. Webometrics. Measuring the size
of the web. The Bow-tie-method.
Hypertext retrieval. Web-based IR. Document
closures. Focused crawling.
Question answering
Burstiness. Self-triggerability
Information extraction
Adversarial IR. Human behavior on the web.
Text summarization
POSSIBLE TOPICS
Discovering communities, spectral clustering
Semi-supervised retrieval
Natural language processing. XML retrieval. Text
tiling. Human behavior on the web.

16
Readings

required Information Retrieval by Manning,
Schuetze, and Raghavan (http//www-csli.stanford.e
du/schuetze/information-retrieval-book.html),
freely available, hard copy for sale
optional Modeling the Internet and the Web
Probabilistic Methods and Algorithms by Pierre
Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
u).
papers from SIGIR, WWW and journals (to be
announced in class).

17
Prerequisites

Linear algebra vectors and matrices.
Calculus Finding extrema of functions.
Probabilities random variables, discrete and
continuous distributions, Bayes theorem.
Programming experience with at least one
web-aware programming language such as Perl
(highly recommended) or Java in a UNIX
environment.
Required CS account

18
Course requirements

Three assignments (30)
Some of them will be in Perl. The rest can be
done in any appropriate language. All will
involve some data analysis and evaluation
Final project (30)
Research paper or software system.
Class participation (10)
Final exam (30)

19
Final project format

Research paper - using the SIGIR format. Students
will be in charge of problem formulation,
literature survey, hypothesis formulation,
experimental design, implementation, and possibly
submission to a conference like SIGIR or WWW.
Software system - develop a working system or
API. Students will be responsible for identifying
a niche problem, implementing it and deploying
it, either on the Web or as an open-source
downloadable tool. The system can be either stand
alone or an extension to an existing one.

20
Project ideas

Build a question answering system.
Build a language identification system.
Social network analysis from the Web.
Participate in the Netflix challenge.
Query log analysis.
Build models of Web evolution.
Information diffusion in blogs or web.
Author-topic models of web pages.
Using the web for machine translation.
Building evolving models of web documents.
News recommendation system.
Compress the text of Wikipedia (losslessly).
Spelling correction using query logs.
Automatic query expansion.

21
List of projects from the past

Document Closures for Indexing
Tibet - Table Structure Recognition Library
Ruby Blog Memetracker
Sentence decomposition for more accurate
information retrieval
Extracting Social Networks from LiveJournal
Google Suggest Programming Project (Java Swing
Client and Lucene Back-End)
Leveraging Social Networks for Organizing and
Browsing Shared Photographs
Media Bias and the Political Blogosphere
Measuring Similarity between search queries
Extracting Social Networks and Information about
the people within them from Text
LSI dependency trees

22
Available corpora

Netflix challenge
AOL query logs
Blogs
Bio papers
AAN
Email
Generifs
Web pages
Political science corpus
VAST
del.icio.us
SMS
News data aquaint, tdt, nantc, reuters, setimes,
trec, tipster
Europarl multilingual
US congressional data
DMOZ
Pubmedcentral
DUC/TAC

Timebank
Wikipedia
wt2g/wt10g/wt100g
dotgov
RTE
Paraphrases
GENIA
Generifs
Hansards
IMDB
MTA/MTC
nie
cnnsumm
Poliblog
Sentiment
xml
epinions
Enron

23
Related courses elsewhere

Stanford (Chris Manning, Prabhakar Raghavan, and
Hinrich Schuetze)
Cornell (Jon Kleinberg)
CMU (Yiming Yang and Jamie Callan)
UMass (James Allan)
UTexas (Ray Mooney)
Illinois (Chengxiang Zhai)
Johns Hopkins (David Yarowsky)
For a long list of courses related to Search
Engines, Natural Language Processing, Machine
Learning look here http//tangra.si.umich.edu/c
lair/clair/courses.html

24
SET FALL 2009
2. Models of Information retrieval The
Vector model The Boolean model
25
The web is really large

100 B pages
Dynamically generated content
New pages get added all the time
Technorati has 50M blogs
The size of the blogosphere doubles every 6
months
Yahoo deals with 12TB of data per day (according
to Ron Brachman)

26
Sample queries (from Excite)
In what year did baseball become an offical
sport? play station codes . com birth control and
depression government "WorkAbility
I"conference kitchen appliances where can I find
a chines rosewood tiger electronics 58 Plymouth
Fury How does the character Seyavash in
Ferdowsi's Shahnameh exhibit characteristics of a
hero? emeril Lagasse Hubble M.S Subalaksmi running
27
Fun things to do with search engines

Googlewhack
Reduce document set size to 1
Find query that will bring given URL in the top
10

28
Key Terms Used in IR

QUERY a representation of what the user is
looking for - can be a list of words or a phrase.
DOCUMENT an information entity that the user
wants to retrieve
COLLECTION a set of documents
INDEX a representation of information that makes
querying easier
TERM word or concept that appears in a document
or a query

29
Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
30
Documents

Not just printed paper
Can be records, pages, sites, images, people,
movies
Document encoding (Unicode)
Document representation
Document preprocessing

31
Sample query sessions (from AOL)

toley spies gramestolley spies gamestotally
spies games
tajmahal restaurant brooklyn nytaj mahal
restaurant brooklyn nytaj mahal restaurant
brooklyn ny 11209
do you love me like you saydo you love me like
you say lyricsdo you love me like you say lyrics
marvin gaye

M /data4/corpora/AOL-user-ct-collection
32
Characteristics of user queries

Sessions users revisit their queries.
Very short queries typically 2 words long.
A large number of typos.
A small number of popular queries. A long tail of
infrequent ones.
Almost no use of advanced query operators with
the exception of double quotes

33
Queries as documents

Advantages
Mathematically easier to manage
Problems
Different lengths
Syntactic differences
Repetitions of words (or lack thereof)

34
Document representations

Term-document matrix (m x n)
Document-document matrix (n x n)
Typical example in a medium-sized collection
3,000,000 documents (n) with 50,000 terms (m)
Typical example on the Web n30,000,000,000,
m1,000,000
Boolean vs. integer-valued matrices

35
Storage issues

Imagine a medium-sized collection with
n3,000,000 and m50,000
How large a term-document matrix will be needed?
Is there any way to do better? Any heuristic?

36
Inverted index

Instead of an incidence vector, use a posting
table
CLEVELAND D1, D2, D6
OHIO D1, D5, D6, D7
Use linked lists to be able to insert new
document postings in order and to remove existing
postings.
Keep everything sorted! This gives you a
logarithmic improvement in access.

37
Basic operations on inverted indexes

Conjunction (AND) iterative merge of the two
postings O(xy)
Disjunction (OR) very similar
Negation (NOT) can we still do it in O(xy)?
Example MICHIGAN AND NOT OHIO
Example MICHIGAN OR NOT OHIO
Recursive operations
Optimization start with the smallest sets

38
Major IR models

Boolean
Vector
Probabilistic
Language modeling
Fuzzy retrieval
Latent semantic indexing

39
The Boolean model
Venn diagrams
z
x
w
y
D1
D2
40
Boolean queries

Operators AND, OR, NOT, parentheses
Example
CLEVELAND AND NOT OHIO
(MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
Ambiguous uses of AND and OR in human language
Exclusive vs. inclusive OR
Restrictive operator AND or OR?

41
Canonical forms of queries

De Morgans Laws

NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)

Normal forms
Conjunctive normal form (CNF)
Disjunctive normal form (DNF)
Reference librarians prefer CNF - why?

42
Evaluating Boolean queries

Incidence vectors
CLEVELAND 1100010
OHIO 1000111
Examples
CLEVELAND AND OHIO
CLEVELAND AND NOT OHIO
CLEVALAND OR OHIO

43
Exercise

D1 computer information retrieval
D2 computer retrieval
D3 information
D4 computer information
Q1 information AND retrieval
Q2 information AND NOT computer

44
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
45
How to deal with?

Multi-word phrases?
Document ranking?

46
The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
47
Vector queries

Each document is represented as a vector
Non-efficient representation
Dimensional compatibility

48
The matching process

Document space
Matching is done between a document and a query
(or between two documents)
Distance vs. similarity measures.
Euclidean distance, Manhattan distance, Word
overlap, Jaccard coefficient, etc.

49
Miscellaneous similarity measures

The Cosine measure (normalized dot product)

? (di x qi)
X ? Y
? (D,Q)

? (di)2
? (qi)2

X Y

The Jaccard coefficient

X ? Y
? (D,Q)
X ? Y
50
Exercise

Compute the cosine scores ? (D1,D2) and ? (D1,D3)
for the documents D1 lt1,3gt, D2 lt100,300gt and
D3 lt3,1gt
Compute the corresponding Euclidean distances,
Manhattan distances, and Jaccard coefficients.

51
Readings