Title: EECS 395495: Web Information Retrieval and Extraction
1EECS 395/495 Web Information Retrieval and
Extraction
2Outline
- Introductions
- Course goals and logistics
- Why take this class?
- What is Web Search?
- What will it be in five years? Ten years?
3Introductions
- Professor Doug Downey
- TA Junsong Yuan
4Goals
- How does Web search work?
- First half of course
- Whats the future of Web search?
- Second half of course
5Logistics
- First half the class lectures
- Second half presentations of research papers
- Groups lead discussions
- Each individual presents a handful of slides
- See Web page and sign up (e-mail TA) this week
6Logistics
- Grading
- Participation (30)
- During lectures/discussions (10)
- Leading a discussion (15)
- Search feature pitch (5)
- Projects (70)
- Project grade, based on individual contribution
(60) - Review of another teams project (10)
7Projects
- Groups of 2-3 (not necessarily discussion
groups) - Examples
- Read 3-5 recent research papers and summarize the
state of the art in some area of IR/IE - Implement an IR/IE system and report on the
results - Answer theoretical questions
- Etc.
8Specific Examples
- Systems
- Automatically place blogs on liberal/conservative
continuum - Create a search knob for specifying reading
level - Execute relevant background searches as I write
an e-mail - Theoretical questions
- Read a paper on search ad auctions or PageRank
and attempt to extend the results - Summary of Research
- Read 5 papers on automated question answering
summarize the field and suggest future directions
9Project Milestones
- April 8 Proposal (1 page)
- Meetings with me to finalize April 9/10
- lots of progress
- May 5 Report of preliminary results (3 pages)
- May 7 Review group provides feedback (1 page)
- June 2 Final Report (4 pages)
- June 4/June 8 (finals week) Final Presentations
(10 mins 5 min for QA)
10Search Feature pitch
- This Thursday (!)
- Individually deliver a 2.5 minute pitch of a
new search engine feature - Plausible, not necessarily possible(see example
later) - Send me your ppt/pdf slides by Thursday at 9AM
- Class discusses for one minute while next
presenter sets up
11Outline
- Introductions
- Course goals and logistics
- Why take this class?
- What is Web Search?
- What will it be in five years? Ten years?
12Who cares about search?
- "The most important application for the
foreseeable future...is search. - Steve Ballmer, CEO of MicrosoftFinancial Times
http//news.cnet.com/8301-10784_3-9973650-7.html - Why?
- Searchs utility scales as the Web scales
- People use it all the time
- Control
- Profit
13Why should you care about search?
- Opportunity
- Fascinating important enabling technologies
- Scaling
- Machine Learning/Data Mining
- Graph-based algorithms
- Language Understanding
- Auction theory
- User Interfaces
14Graph-based Algorithms
- Example PageRank
- Googles original claim to fame
- Idea Quality of p is proportional to the
aggregate quality of the pages linking to p - PageRank(p) probability that a random surfer
lands on p - Pick a starting page at random
- Follow links uniformly at random
- Every now and then, jump to a random page
- PageRank(p) proportion of visits to p
15PageRank example
15 probability of a random jump
C
B
A
F
D
E
16 17How to compute PageRank
- Simulate a random surfer?
- On 20 billion pages and 400 billion hyperlinks
- It can be done
- Youll learn how
- What about
- Personalized PageRank? Link spam?
18What do you need to take this class?
- EECS 311
- basic understanding of algorithms and data
structures - Basics of linear algebra and probability theory
- Tolerance for non-linearity
- Willingness to participate
19Outline
- Introductions
- Course goals and logistics
- Why take this class?
- What is Web Search?
- What will it be in five years? Ten years?
20Web Search today
- Performs an easy task
- Extremely quickly
- At massive scale
- Relatively well
21Easy?
- Belief Web Search engines have to understand my
query and find a needle in a haystack of 15
billion documents! - Reality Most search queries are
- short (avg. 2.5 words 2005)
- satisfied by pages from a small subset of the Web
- Millions rather than billions Mei et al., 2008
- gt More like finding a pencil in a haystack!
22Extremely Quickly
- Results returned in lt 1 sec
- For almost any query
- For any engine
- It was not always thus!
- 3-4 seconds in early days (Chu Rosenthal,
1996 Garratt et al. 2001) - How? Inverted indices, enormous data centers,
clever algorithms
23Users are quick too
Downey, Dumais, Horvitz 2007
24At Massive Scale
- (in millions)
- How? Inverted indices, enormous data centers,
clever algorithms
http//blog.searchenginewatch.com/comscoresearchsh
arefeb2009_0309.jpg
25Relatively Well
26- Note, for images missing from this online
version see links at bottom of page
http//www.seoresearcher.com/distribution-of-click
s-on-googles-serps-and-eye-tracking-analysis.htm
27Web Search Engines today
- For the most part
- Perform an easy task
- Extremely quickly
- At massive scale
- Relatively well
- The future of Web search is in more difficult
tasks
28Trend toward more difficult queries
- Search queries are getting longer.
http//www.readwriteweb.com/archives/hitwise_searc
h_queries_are_getting_longer.php
29Things you cant do with search today
- Query by description rather than content
- Humorous anecdotes about Perry Farrell
- Extracting and synthesizing over multiple pages
- Nanotechnology companies hiring on the West Coast
- Substances the FDA has banned
- Organizing bodies of documents
- Show me the most compelling cases for and against
the recent stimulus package - Who says drinking from aluminum soda cans
increases my Alzheimer's risk? Should I believe
them?
30Image Search for famous pink building
31Search Feature Pitch
- Guidelines
- Introduce yourself (name, major, degree, year)
- Whats the problem
- Whats your solution
- What makes it feasible
- 2 1/2 minutes isnt a lot of time!!
32Mechanism for over-specified queries
- Florida gators football depth chart 2002
- 0 results
- Florida gators football depth chart 2002
- Many results -- none correct in top 10
- After protracted manual reformulation process
- 2002 gator football media guide
- 8 results, top several correct
33Proposed Solution
- User submits over-specified query
- one with zero hits, e.g.Florida gators football
depth chart 2002 - Search engine
- Tries moving quotes, substituting phrases, etc.
- Florida gators football depth chart 2002
- Florida gators depth chart 2002 pigskin
- After a few minutes, returns summary of results
34Why plausible?
- Substitutions can be obtained from thesauri or
statistical co-occurrence, e.g. - P(depth chart football media guide) is
largegt try substituting football media guide
for depth chart in query - Given a few minutes, we can scalably execute
several hundred candidate queries
35Reminder
- Your search engine feature pitch
- Presented this Thursday in class
- Send ppt or pdf to Junsong I by 9AM Thurs
- See course Web page
- Also
- Start forming project groups
- Sign up for team mtg with me next Thurs/Fri
- Look for discussion papers dates later today
36(No Transcript)