Search Engine Technology 1 http:www.cs.columbia.eduradevSET07.html - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Search Engine Technology 1 http:www.cs.columbia.eduradevSET07.html

Description:

Small worlds and scale-free networks. Power law distributions. Centrality. Graph-based methods. ... spies games. totally spies games. tajmahal restaurant ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 47
Provided by: rad2
Category:

less

Transcript and Presenter's Notes

Title: Search Engine Technology 1 http:www.cs.columbia.eduradevSET07.html


1
Search Engine Technology (1)http//www.cs.columb
ia.edu/radev/SET07.html
  • September 6, 2007
  • Prof. Dragomir R. Radev
  • radev_at_umich.edu

2
SET Fall 2007
  • Introduction

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Examples of search engines
  • Conventional (library catalog). Search by
    keyword, title, author, etc.
  • Text-based (Lexis-Nexis, Google, Yahoo!).Search
    by keywords. Limited search using queries in
    natural language.
  • Multimedia (QBIC, WebSeek, SaFe)Search by visual
    appearance (shapes, colors, ).
  • Question answering systems (Ask, NSIR,
    Answerbus)Search in (restricted) natural
    language
  • Clustering systems (Vivisimo, Clusty)
  • Research systems (Lemur, Nutch)

8
What does it take to build a search engine?
  • Decide what to index
  • Collect it
  • Index it (efficiently)
  • Keep the index up to date
  • Provide user-friendly query facilities

9
What else?
  • Understand the structure of the web for efficient
    crawling
  • Understand user information needs
  • Preprocess text and other unstructured data
  • Cluster data
  • Classify data
  • Evaluate performance

10
Goals of the course
  • Understand how to collect, store, index, analyze,
    search and present large quantities of
    unstructured text.
  • Understand the dynamics of the Web by building
    appropriate mathematical models.
  • Build working systems that assist users in
    finding useful information on the Web.
  • Understand and use third party software.

11
Course logistics
  • Thursdays 6-8 PM in 833 Mudd
  • Office hour tba
  • Web site http//www.cs.columbia.edu/radev/SET07
  • Instructor Dragomir Radev (PhD, Columbia-CS)
  • Email radev_at_umich.edu (please do not send me
    mail at Columbia)
  • TA tba

12
Course outline
  • Classic document retrieval storing, indexing,
    retrieval.
  • Web retrieval crawling, query processing.
  • Text and web mining classification, clustering.
  • Network analysis random graph models,
    centrality, diameter and clustering coefficient.

13
Syllabus
  • Introduction.
  • Queries and Documents. Models of Information
    retrieval. The Boolean model. The Vector model.
  • Document preprocessing. Tokenization. Stemming.
    The Porter algorithm. Storing, indexing and
    searching text. Inverted indexes.
  • Word distributions. The Zipf distribution. The
    Benford distribution. Heap's law. TFIDF. Vector
    space similarity and ranking.
  • Retrieval evaluation. Precision and Recall.
    F-measure. Reference collections. The TREC
    conferences.
  • Automated indexing/labeling. Compression and
    coding. Optimal codes.
  • String matching. Approximate matching.
  • Query expansion. Relevance feedback.

14
Syllabus
  • Text classification. Naive Bayes. Feature
    selection. Decision trees.
  • Linear classifiers. k-nearest neighbors.
    Perceptron. Kernel methods. Maximum-margin
    classifiers. Support vector machines.
    Semi-supervised learning.
  • Lexical semantics and Wordnet.
  • Latent semantic indexing. Singular value
    decomposition.
  • Vector space clustering. k-means clustering. EM
    clustering.
  • Crawling the web. Webometrics. Measuring the size
    of the web. The Bow-tie model.
  • Hypertext retrieval. Web-based IR. Document
    closures. Focused crawling.

15
Syllabus
  • Random graph models. Properties of random graphs
    clustering coefficient, betweenness, diameter,
    giant connected component, degree distribution.
  • Social network analysis. Small worlds and
    scale-free networks. Power law distributions.
    Centrality.
  • Graph-based methods. Harmonic functions. Random
    walks. PageRank.
  • Models of the web. Hubs and authorities.
    Bipartite graphs. HITS and SALSA.
  • Discovering communities. Spectral clustering.
  • Semi-supervised passage retrieval. Question
    answering.
  • Probabilistic models of IR. Language models.
    Burstiness. Self-triggerability.
  • Quick overview of topics such as Collaborative
    filtering. Recommendation systems. Information
    extraction. Hidden Markov Models. Conditional
    Random Fields. User behavior .
  • Quick overview of topics such as Adversarial IR.
    Spamming and anti-spamming methods. Text
    summarization.
  • Student presentations (between Dec 14 and Dec 21)
  • Additional topics (if time permits)
  • Natural language processing. XML retrieval. Text
    tiling. Human behavior on the web.

16
Readings
  • required Information Retrieval by Manning,
    Schuetze, and Raghavan (http//www-csli.stanford.e
    du/schuetze/information-retrieval-book.html),
    freely available, mirrored on August 14, 2007.
  • optional Modeling the Internet and the Web
    Probabilistic Methods and Algorithms by Pierre
    Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
    2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
    u).
  • papers from SIGIR, WWW and journals (to be
    announced in class).

17
Prerequisites
  • Linear algebra vectors, matrices, and operations
    on them, determinants, eigenvectors.
  • Calculus differentiation, finding extrema of
    functions.
  • Probabilities random variables, discrete and
    continuous distributions, Bayes theorem.
  • Programming experience with at least one
    web-aware programming language such as Perl
    (highly recommended) or Java in a UNIX
    environment.
  • Required CS account (check CS web site)

18
Course requirements
  • Three (mostly programming) assignments (40)
  • Some of them will be in Perl. The rest can be
    done in any appropriate language.
  • Reading assignments (10)
  • Final project (40)
  • Students will present their final project in a
    poster session in class.
  • Class participation (10)
  • No final exam.

19
Final project format
  • Research paper - using the SIGIR format. Students
    will be in charge of problem formulation,
    literature survey, hypothesis formulation,
    experimental design, implementation, and possibly
    submission to a conference like SIGIR or WWW.
  • Software system - develop a working system or
    API. Students will be responsible for identifying
    a niche problem, implementing it and deploying
    it, either on the Web or as an open-source
    downloadable tool. The system can be either stand
    alone or an extension to an existing one.

20
Project ideas
  • Build a question answering system.
  • Build a language identification system.
  • Social network analysis from the Web.
  • Participate in the Netflix challenge.
  • Query log analysis.
  • Build models of Web evolution.
  • Information diffusion in blogs or web.
  • Author-topic models of web pages.
  • Using the web for machine translation.
  • Building evolving models of web documents.
  • News recommendation system.
  • Compress the text of Wikipedia (losslessly).
  • Spelling correction using query logs.
  • Automatic query expansion.

21
List of projects from last year
  • Document Closures for Indexing
  • Tibet - Table Structure Recognition Library
  • Ruby Blog Memetracker
  • Sentence decomposition for more accurate
    information retrieval
  • Extracting Social Networks from LiveJournal
  • Google Suggest Programming Project (Java Swing
    Client and Lucene Back-End)
  • Leveraging Social Networks for Organizing and
    Browsing Shared Photographs
  • Media Bias and the Political Blogosphere
  • Measuring Similarity between search queries
  • Extracting Social Networks and Information about
    the people within them from Text
  • LSI dependency trees

22
Available corpora
  • Enron email
  • CIA world factbook
  • DBLP papers in CS
  • NNDB information about people
  • BLOGS collection of blogs
  • US congressional speeches
  • AOL queries
  • Netflix recommendations
  • IMDB
  • NIE news articles
  • PUBMED biomedical paper abstracts
  • Wikipedia
  • ACL Anthology collection of papers in NLP/CL
  • DOTGOV download of .GOV
  • biocreative biomedical papers
  • WT100G 100GB download of the web
  • Google n-grams
  • webfreq frequency of words on the web
  • SMS corpus
  • Citeseer CS papers
  • DMOZ the open directory project
  • corpus of paraphrases
  • multilingual parallel parliamentary proceedings
  • textual entailment corpus
  • question answering corpus
  • summarization corpus
  • various text classification corpora
    (Reuters-21578, 20NG)
  • Peekaboom (from the game)

23
Related courses elsewhere
  • Stanford (Chris Manning, Prabhakar Raghavan, and
    Hinrich Schuetze)
  • Cornell (Jon Kleinberg)
  • CMU (Yiming Yang and Jamie Callan)
  • UMass (James Allan)
  • UTexas (Ray Mooney)
  • Illinois (Chengxiang Zhai)
  • Johns Hopkins (David Yarowsky)
  • For a long list of courses related to Search
    Engines, Natural Language Processing, Machine
    Learning look herehttp//clair.si.umich.edu808
    0/wordpress/?p11

24
SET Fall 2007
2. Models of Information retrieval The
Vector model The Boolean model
25
Sample queries (from Excite)
  • In what year did baseball become an offical
    sport?
  • play station codes . com
  • birth control and depression
  • government
  • "WorkAbility I"conference
  • kitchen appliances
  • where can I find a chines rosewood
  • tiger electronics
  • 58 Plymouth Fury
  • How does the character Seyavash in Ferdowsi's
    Shahnameh exhibit characteristics of a hero?
  • emeril Lagasse
  • Hubble
  • M.S Subalaksmi
  • running

26
Key Terms Used in IR
  • QUERY a representation of what the user is
    looking for - can be a list of words or a phrase.
  • DOCUMENT an information entity that the user
    wants to retrieve
  • COLLECTION a set of documents
  • INDEX a representation of information that makes
    querying easier
  • TERM word or concept that appears in a document
    or a query

27
Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
28
Documents
  • Not just printed paper
  • Can be records, pages, sites, images, people,
    movies
  • Document encoding (Unicode)
  • Document representation
  • Document preprocessing

29
Sample query sessions (from AOL)
  • toley spies gramestolley spies gamestotally
    spies games
  • tajmahal restaurant brooklyn nytaj mahal
    restaurant brooklyn nytaj mahal restaurant
    brooklyn ny 11209
  • do you love me like you saydo you love me like
    you say lyricsdo you love me like you say lyrics
    marvin gaye

30
Characteristics of user queries
  • Sessions users revisit their queries.
  • Very short queries typically 2 words long.
  • A large number of typos.
  • A small number of popular queries. A long tail of
    infrequent ones.
  • Almost no use of advanced query operators with
    the exception of double quotes

31
Queries as documents
  • Advantages
  • Mathematically easier to manage
  • Problems
  • Different lengths
  • Syntactic differences
  • Repetitions of words (or lack thereof)

32
Document representations
  • Term-document matrix (m x n)
  • Document-document matrix (n x n)
  • Typical example in a medium-sized collection
    3,000,000 documents (n) with 50,000 terms (m)
  • Typical example on the Web n30,000,000,000,
    m1,000,000
  • Boolean vs. integer-valued matrices

33
Major IR models
  • Boolean
  • Vector
  • Probabilistic
  • Language modeling
  • Fuzzy retrieval
  • Latent semantic indexing

34
The Boolean model
Venn diagrams
z
x
w
y
D1
D2
35
Boolean queries
  • Operators AND, OR, NOT, parentheses
  • Example
  • CLEVELAND AND NOT OHIO
  • (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
  • Ambiguous uses of AND and OR in human language
  • Exclusive vs. inclusive OR
  • Restrictive operator AND or OR?

36
Canonical forms of queries
  • De Morgans Laws

NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
  • Normal forms
  • Conjunctive normal form (CNF)
  • Disjunctive normal form (DNF)
  • Reference librarians prefer CNF - why?

37
Evaluating Boolean queries
  • Incidence vectors
  • CLEVELAND 1100010
  • OHIO 1000111
  • Examples
  • CLEVELAND AND OHIO
  • CLEVELAND AND NOT OHIO
  • CLEVALAND OR OHIO

38
Exercise
  • D1 computer information retrieval
  • D2 computer retrieval
  • D3 information
  • D4 computer information
  • Q1 information AND retrieval
  • Q2 information AND NOT computer

39
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
40
How to deal with?
  • Multi-word phrases?
  • Document ranking?

41
The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
42
Vector queries
  • Each document is represented as a vector
  • Non-efficient representation
  • Dimensional compatibility

43
The matching process
  • Document space
  • Matching is done between a document and a query
    (or between two documents)
  • Distance vs. similarity measures.
  • Euclidean distance, Manhattan distance, Word
    overlap, Jaccard coefficient, etc.

44
Miscellaneous similarity measures
  • The Cosine measure (normalized dot product)

? (di x qi)
X ? Y
? (D,Q)

? (di)2
? (qi)2

X Y
  • The Jaccard coefficient

X ? Y
? (D,Q)
X ? Y
45
Exercise
  • Compute the cosine scores ? (D1,D2) and ? (D1,D3)
    for the documents D1 , D2 and
    D3
  • Compute the corresponding Euclidean distances,
    Manhattan distances, and Jaccard coefficients.

46
Readings
  • For September 13 MRS1, MRS2, MRS5 (Zipf)
  • For September 20 MRS7, MRS8
Write a Comment
User Comments (0)
About PowerShow.com