Inside Internet Search Engines: Spidering and Indexing - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Inside Internet Search Engines: Spidering and Indexing

Description:

(1) Pick Url from pending queue and fetch (2) Parse document and ... Specialized services (Deja) Information extraction. Shopping catalog. Events; recipes, etc. ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 16

Provided by: aiBpaA

Category:

Tags: deja | engines | indexing | inside | internet | search | spidering

Transcript and Presenter's Notes

Title: Inside Internet Search Engines: Spidering and Indexing

1
Inside Internet Search EnginesSpidering and
Indexing

Jan Pedersen
and
William Chang

2
Basic Architectures Search
Log
20M queries/day
Spider
Web
SE
Spam
Index
Browser
SE
SE
Freshness
24x7
Quality results
800M pages?
3
Basic Algorithm

(1) Pick Url from pending queue and fetch
(2) Parse document and extract hrefs
(3) Place unvisited Urls on pending queue
(4) Index document
(5) Goto (1)

4
Issues

Queue maintenance determines behavior
Depth vs breadth
Spidering can be distributed
but queues must be shared
Urls must be revisited
Status tracked in a Database
Revisit rate determines freshness
SEs typically revisit every url monthly

5
Deduping

Many urls point to the same pages
DNS aliasing
Many pages are identical
Site mirroring
How big is my index, really?

6
Smart Spidering

Revisit rate based on modification history
Rapidly changing documents visited more often
Revisit queues divided by priority
Acceptance criteria based on quality
Only index quality documents
Determined algorithmically

7
Spider Equilibrium

Urls queues do not increase in size
New documents are discovered and indexed
Spider keeps up with desired revisit rate
Index drifts upward in size
At equilibrium index is Everyday Fresh
As if every page were revisited every day
Requires 10 daily revisit rates, on average

8
Computational Constraints

Equilibrium requires increasing resources
Yet total disk space is a system constraint
Strategies for dealing with space constraints
Simple refresh only revisit known urls
Prune urls via stricter acceptance criteria
Buy more disk

9
Special Collections

Newswire
Newsgroups
Specialized services (Deja)
Information extraction
Shopping catalog
Events recipes, etc.

10
The Hidden Web

Non-indexible content
Behind passwords, firewalls
Dynamic content
Often searchable through local interface
Network of distributed search resources
How to access?
Ask Jeeves!

11
Spam

Manipulation of content to affect ranking
Bogus meta tags
Hidden text
Jump pages tuned for each search engine
Add Url is a spammers tool
99 of submissions are spam
Its an arms race

12
Representation

For precision, indices must support phrases
Phrases make best use of short queries
The web is precision biased
Document location also important
Title vs summary vs body
Meta tags offer a special challenge
To index or not?

13
Indexing Tricks

Inverted indices are non-incremental
Design for compactness and high-speed access
Updated through merge with new indices
Indices can be huge
Minimize copying
Use Raid for speed and reliability

14
Truncation

Search Engines do not store all postings
How could they?
Tuned to return 10 good hits quickly
Boolean queries evaluated conservatively
Negation is a particular problem
Some measurement methods depend on strong queries
how accurate can they be?

15
The Role of NLP

Many Search Engines do not stem
Precision bias suggests conservative term
treatment
What about non-English documents
N-grams are popular for Chinese
Language ID anyone?

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

World's Best PowerPoint Templates PowerPoint PPT Presentation

World's Best PowerPoint Templates - CrystalGraphics offers more PowerPoint templates than anyone else in the world, with over 4 million to choose from. Winner of the Standing Ovation Award for “Best PowerPoint Templates” from Presentations Magazine. They'll give your presentations a professional, memorable appearance - the kind of sophisticated look that today's audiences expect. Boasting an impressive range of designs, they will support your presentations with inspiring background photos or videos that support your themes, set the right mood, enhance your credibility and inspire your audiences.

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Inside Internet Search Engines: Fundamentals PowerPoint PPT Presentation

Inside Internet Search Engines: Fundamentals - Inside Internet Search Engines: Fundamentals. Jan Pedersen. and ... Search Engine Watch. www.searchenginewatch.com 'Analysis of a Very Large Alta Vista ... | PowerPoint PPT presentation | free to view

From Search Engines to Wed Mining Web Search Engines, Spiders, Portals, Web APIs, and Web Mining: Fr PowerPoint PPT Presentation

From Search Engines to Wed Mining Web Search Engines, Spiders, Portals, Web APIs, and Web Mining: Fr - From Search Engines to Wed Mining Web Search Engines, Spiders, Portals, Web APIs, and Web Mining: Fr | PowerPoint PPT presentation | free to view

Indexing your web server(s) PowerPoint PPT Presentation

Indexing your web server(s) - what kind of search (keyword, phrase, natural language, constrained) ... You may have to have advertising on your search page as a condition of use ... | PowerPoint PPT presentation | free to view

Opportunities and Challenges of Web Search and Mining PowerPoint PPT Presentation

Opportunities and Challenges of Web Search and Mining - Title: Ongoing Research Author: Lee-Feng Chien Last modified by: wkd Created Date: 4/24/2002 1:15:34 PM Document presentation format: | PowerPoint PPT presentation | free to view

SEO Guide for Beginners, The Beginner Guide to SEO PowerPoint PPT Presentation

SEO Guide for Beginners, The Beginner Guide to SEO - The Beginner guide to SEO: Learn all the facts, tactics, strategies of SEO. The best SEO Guide for Beginners. How it works & what factors affect search. Call +91-782-774-2414 for any help. | PowerPoint PPT presentation | free to view

Crucial SEO terms PowerPoint PPT Presentation

Crucial SEO terms - Search engine optimization is nowadays more important than ever. Good SEO practices improve the user experience and usability of a web site. To know more about seo, visit https://treuemax.com | PowerPoint PPT presentation | free to view

E-MARKETING (INTERNET MARKETING) PowerPoint PPT Presentation

E-MARKETING (INTERNET MARKETING) - Title: E-Marketing Author: NEC Computers International Last modified by: Jim Michael Widi Created Date: 10/7/2005 7:56:37 AM Document presentation format | PowerPoint PPT presentation | free to view

Importance of SEO and 6 Law Firm SEO Tools for Having High Rankings PowerPoint PPT Presentation

Importance of SEO and 6 Law Firm SEO Tools for Having High Rankings - SEO helps in creating organic searches and helps law firm’s websites to rank on the position on the search engines based on the content where usage of keywords and unique content on your websites helps in the ranking of the websites that gives you the position unpaid. | PowerPoint PPT presentation | free to view

Lecture Three PowerPoint PPT Presentation

Lecture Three - The Importance of a SE Directory Listing. Understanding Search Engines ... and JavaScript work in a similar fashion to images, in that the code used ... | PowerPoint PPT presentation | free to view

Using the Internet to Research Energy Statistics PowerPoint PPT Presentation

Using the Internet to Research Energy Statistics - There is no Dewey decimal system or central 'card catalog' for the Internet ... Webcrawler (webcrawler.com) Webcrawler Example. Searching News Groups. Conclusions ... | PowerPoint PPT presentation | free to view

Internet Systems Architecture TMG 5372 PowerPoint PPT Presentation

Internet Systems Architecture TMG 5372 - This allows you to create/delete directories, get files and put files. Architectural Model ... are .us (United States), .fr (France) and .uk (United Kingdom) ... | PowerPoint PPT presentation | free to view

Indexing and Search Engines PowerPoint PPT Presentation

Indexing and Search Engines - List of some Intranet Search Engines. Conclusion. References ... Search Engine ... Compile a list of possible products. Choosing the right search engine ... | PowerPoint PPT presentation | free to view

Marketing Online Finding new and keeping customers using the internet PowerPoint PPT Presentation

Marketing Online Finding new and keeping customers using the internet - customers using the internet. Nygllhuw Morris & Jon Jackson ... Sending promotional information via email. to ... If you can, get more than their email address ... | PowerPoint PPT presentation | free to view

Accomplish Competitive Search engine optimization Strategies Utilizing These PowerPoint PPT Presentation

Accomplish Competitive Search engine optimization Strategies Utilizing These - Accomplish Competitive Search engine optimization Strategies Utilizing These | PowerPoint PPT presentation | free to view

Trouble With Search engine marketing PowerPoint PPT Presentation

Trouble With Search engine marketing - Trouble With Search engine marketing | PowerPoint PPT presentation | free to view

SEO for beginners PowerPoint PPT Presentation

SEO for beginners - SEO (Search Engine Optimisation) involves altering website code, content, and ... CSS has little or no impact on SEO ... most important step to effective SEO ... | PowerPoint PPT presentation | free to view

DLK TECHNOLOGIES PowerPoint PPT Presentation

DLK TECHNOLOGIES - DLK Technologies is an SEO company with almost a decade in providing SEO services and have developed perspectives quite different from most SEO services in INDIA. We offer clients a wide range SEO, SEM, IT, Software and BPO services. These services enable business to "Do Business Better". | PowerPoint PPT presentation | free to view

The Invisible Web PowerPoint PPT Presentation

The Invisible Web - 'Stuff' that search engine crawlers (spiders) can not -- or ... MapBlast. http://www.mapblast.com. Streetmap.co.uk. http://www.streetmap.co.uk/ Invisible Web: ... | PowerPoint PPT presentation | free to view

Basics of Search Engine Optimization PowerPoint PPT Presentation

Basics of Search Engine Optimization - Basics of Search Engine Optimization | PowerPoint PPT presentation | free to view

There are only 10 types of people in the world: Those who understand binary, and those who don't' PowerPoint PPT Presentation

There are only 10 types of people in the world: Those who understand binary, and those who don't' - Knowledge discovery(mining) in databases (KDD), knowledge extraction, data ... Comer, Computers Networks and Internets with Internet Applications, 4e, Person ... | PowerPoint PPT presentation | free to view

seo miami PowerPoint PPT Presentation

seo miami - seo miami | PowerPoint PPT presentation | free to view

The Deep Web: An Introduction PowerPoint PPT Presentation

The Deep Web: An Introduction - today's news headlines (newspaper front page' ... Often group results at the side or top of the screen and will usually tell you ... | PowerPoint PPT presentation | free to view

Moderator: Arlene Ang, Regional Managing Director, Publicis Modem Asia Presenters: Reddy Kalluru, Zu PowerPoint PPT Presentation

Moderator: Arlene Ang, Regional Managing Director, Publicis Modem Asia Presenters: Reddy Kalluru, Zu - Moderator: Arlene Ang, Regional Managing Director, Publicis Modem Asia Presenters: Reddy Kalluru, Zu | PowerPoint PPT presentation | free to view

Internet Concepts PowerPoint PPT Presentation

Internet Concepts - Many ISPs provide other services like email and web hosting as part of their ... Suppose your ISP is BellSouth (www.bellsouth.com) ... | PowerPoint PPT presentation | free to view

EC and the Virtual Corporation PowerPoint PPT Presentation

EC and the Virtual Corporation - Communications: the delivery of information, products ... www.remax.com. Business-Consumer. Service Based. Traditional. Candidates for success: ... | PowerPoint PPT presentation | free to view

Best Digital Marketing Company in Cleveland, Ohio | ERF Digital Solutions PowerPoint PPT Presentation

Best Digital Marketing Company in Cleveland, Ohio | ERF Digital Solutions - We are the best digital marketing company in Cleveland, Ohio, built to make big things happen for brands online with Web Design, SEO, PPC, and SMO. Get in touch with ERF Digital today to grow your business. Our SEO Company will Ignite Your Growth! Call us: (878) 967-2416. http://www.erfdigital.com/ | PowerPoint PPT presentation | free to view

Moz-The-Beginners-Guide-To-SEO PowerPoint PPT Presentation

Moz-The-Beginners-Guide-To-SEO - The Beginners Guide To SEO | PowerPoint PPT presentation | free to view