Search and Discovery: Searching the Web - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Search and Discovery: Searching the Web

Description:

Search and Discovery: Searching the Web – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 43
Provided by: chrisb2
Learn more at: https://www.cs.usfca.edu
Category:

less

Transcript and Presenter's Notes

Title: Search and Discovery: Searching the Web


1
Search and DiscoverySearching the Web
2
Stages of a transaction
  • Discovery
  • Find what youre interested in
  • Locate sellers
  • Locate buyers
  • Compare products
  • Negotiation
  • Exchange

3
Discovery
  • Encompasses
  • Search engines
  • Recommender systems
  • Price comparison/shopping agents
  • Description languages
  • Data sources
  • Generic sources portals, web directories
  • Domain-specific sources catalogs, guides, etc.
  • Advertising

4
Discovery
  • More than just finding a resource
  • Need to be able to estimate value, likelihood of
    successful negotiation
  • An evaluative infrastructure is required
  • Least formalized of e-commerce subareas.
  • Unlikely to have a general-purpose solution soon
  • Too complex

5
A Brief History of the Web
  • Prehistory
  • Hypertext as an idea has been around since the
    40s.
  • Vannevar Bush Memex
  • Engelbart 60s
  • 1987 Hypercard
  • Graphical tool allowing users to create
    hyperlinked documents.
  • Late 80s/early 90s WAIS, Gopher

6
A Brief History of the Web
  • 1989/90 Tim Berners-Lee proposes the WWW at CERN
  • A new global information retrieval system
  • Develops HTML, a simple markup language
  • 1993 Mosaic developed at NCSA
  • Marc Andressen then founds Netscape
  • 1993/94 NCSA httpd released
  • Open-source web server, supported CGI
  • Precursor to Apache

7
A Brief History of the Web
  • 1994 Banner ads appear on HotWired
  • Beginning of the commercial web
  • 1994 Yahoo founded
  • Appearance of the portal, search engine
  • 1995 NSF backbone privatized
  • ATT, Sprint, etc take over traffic
  • Network Solutions given a monopoly on domain
    names
  • 1995 Microsoft releases Internet Explorer
  • In 7 years, Netscape goes from 100 market share
    to 20 (2001).

8
A Brief History of the Web
  • 1995 AltaVista started
  • Full-text Web search
  • 1995 Andressen first WWW billionaire
  • 1995 Sun introduces Java
  • Able to ship code and text across networks
  • 1995 eBay founded
  • First online auction
  • 1995-98 Explosive growth
  • Many new formats, applications, companies
  • 1998 Akamai founded (web caching)

9
A Brief History of the Web
  • 1998 ICANN governs names addresses
  • 1998 MP3 format popularized
  • WinAmp released
  • Small enough to make audio distribution practical
  • 1998 Google founded.
  • 2000 Napster appears
  • Beginnings of peer-to-peer technology, file
    sharing
  • 2000(ish) End of the boom
  • Consolidation, reduction in growth

10
Lessons from Radio
  • Radio was popularized in the 1920s
  • Originally intended as a one-to-one messaging
    system.
  • Fee-for-use pay structure.
  • 1922 Explosive growth begins
  • RCAs revenues from sales of receivers doubled
    each year
  • Broadcast model becomes prevalent
  • Thousands of broadcasters emerge

11
Lessons From Radio
  • 1922-1924 Transition
  • How to make money broadcasting?
  • Support sale of receivers
  • Goodwill (sponsors)
  • Public good supported as a non-profit
  • Advertising
  • Tube tax/set tax (a la BBC)
  • By 1924, stations are failing as quickly as they
    start.

12
Lessons From Radio
  • Affordable content driven by audience size
  • Rich-get-richer for large stations
  • 1926 RCA launches NBC
  • First nationwide broadcast
  • Creates the network system
  • National content, local broadcasting
  • Advertising the dominant revenue generator
  • WWW questions
  • Who will be NBC?
  • What will the revenue model be?
  • Advertising? Competition with TV, radio for this
    revenue.
  • Micropayments? Subscriptions? Content
    aggregation?

13
Searching the Web
  • Web growth estimated at 1000 in late 90s.
  • Can search engines keep up with this growth?
  • How to deal with the dynamic nature of the web?
  • Page contents change
  • Pages appear, disappear, move
  • Link structure changes

14
Search Engines
  • Most common form of discovery
  • Crawl the web to collect pages
  • Stored and indexed for easy retrieval
  • Query languages simple
  • Goals
  • Fast retrieval (Google gets 150 million queries
    per day)
  • Accurate (no dead links)
  • Precise (pages match users needs)

15
Terminology
  • Outward link
  • Object that a page links to
  • Outdegree number of outward links
  • Inward link
  • Pages that link to an object
  • Indegree number of inward links
  • Path
  • Series of outward links from A to B

16
The Web as a Directed Graph
  • We can represent the web as a directed graph.
  • Sites are nodes
  • Links are edges.
  • Outward link
  • Object that a page links to
  • Inward link
  • Pages that link to an object

17
The Web as a Directed Graph
18
Adjacency Matrix
  • We can also represent the Web as a very large
    adjacency matrix.
  • The eigenvector of this matrix illustrates the
    clusteredness of the Web
  • Distribution of in-degree and out-degree
  • Connectedness
  • Some ranking algorithms (HITS) use this measure.

19
Web structure
  • Web can be broken into four areas
    (Kleinberg/Lawrence)
  • Core Path between any two pages
  • Upstream Can reach the core, but no path from
    core.
  • Downstream can be reached from core, but cannot
    reach core.
  • Tendrils/islands disconnected from the core.
  • Areas (allegedly) have roughly equal size.

20
Coverage
  • Search engines claim they index a large fraction
    of the web.
  • How to verify this?
  • Run queries on many engines and compare number of
    hits.
  • May return irrelevant documents
  • Documents may no longer exist
  • Documents may have changed

21
Coverage
  • NEC (1998) Estimate size of web, coverage for
    major search engines.
  • Query each engine, retrieve and compare all
    results (only exact matches).
  • Coverage estimates
  • HotBot 57, AltaVista 46
  • NorthernLight 33, Excite 23
  • Infoseek 16, Lycos 4

22
Estimating the size of the indexable web
  • Overlap in coverage was used to estimate size.

A
B
U
U/B serves as an estimate of A/N, where N is the
size of the Web. 1998 Altavista/Hotbot
estimate 320 million pages.
23
Using size to refine coverage estimates.(1997)
  • This value can then be used to determine a
    coverage estimate for each engine.
  • For each pair, solve for N.
  • Assume real N is largest found.
  • Updated HotBot 34, AltaVista 28
  • NorthernLight 20, Excite 14
  • Infoseek 10, Lycos 3

24
Updates (1999)
  • Web growth ahead of indexing
  • No search engine covers more than 16 of the Web.
  • Union of all engines 50 coverage
  • Estimated size 800 million pages
  • Search engines more likely to link to authorities
  • More likely to link to US, commercial sites.

25
Updates (12/2001)
  • Self-reported number of pages indexed
  • Google 2 billion (3 billion today)
  • FAST (AllTheWeb.com) 625 million
  • (claimed 2.1 billion in 2002)
  • Altavista 550 million
  • Inktomi 500 million
  • NorthernLight 390 million

26
Indexing the web
  • Spiders are used to crawl the web and collect
    pages.
  • A page is downloaded and its outward links are
    found.
  • Each outward link is then downloaded.
  • Exceptions
  • Links from CGI interfaces
  • Robot Exclusion Standard

27
Indexing the Web
  • Stop words stripped from page
  • Forward index created
  • Bundles words
  • Maps words to documents.
  • Can use TFIDF to only map significant keywords
  • Term Frequency InverseDocumentFrequency

28
Indexing the web
  • An inverted index is created
  • Forward index sorted according to word
  • Maps keywords to URLs
  • Some wrinkles
  • Morphology stripping suffixes (stemming),
    singular vs. plural, tense, case folding
  • Semantic similarity
  • Words with similar meanings share an index.
  • Issue trading coverage (number of hits) for
    precision (how closely hits match request)

29
Indexing Issues
  • Indexing techniques were designed for static
    collections
  • How to deal with pages that change?
  • Periodic crawls, rebuild index.
  • Varied frequency crawls
  • Records need a way to be purged
  • Hash of page stored
  • Can use the text of a link to a page to help
    label that page.
  • Helps eliminate the addition of spurious keywords.

30
Indexing Issues
  • Availability and speed
  • Most search engines will cache the page being
    referenced.
  • Multiple search terms
  • OR separate searches concatenated
  • AND intersection of searches computed.
  • Regular expressions not typically handled.
  • Parsing
  • Must be able to handle malformed HTML, partial
    documents

31
PageRank
  • Google uses PageRank to determine relevance.
  • Based on the quality of a pages inward links.
  • Average the PageRanks of each page that points to
    a given page, divided by their outdegree.
  • Let p be a page, with T1 Tn linking to p.
  • PR(p) (1-d) d(SumI(Pr(TI)/outI))
  • d is a damping factor.
  • PR propagates through a graph.

32
PageRank
  • Justification
  • Imagine a random surfer who keeps clicking
    through links.
  • d is the probability she starts a new search.
  • Or
  • A page has a high ranking if highly ranked pages
    point to it.
  • Pros difficult to game the system
  • Cons Creates a rich get richer web structure
    where highly popular sites grow in popularity.

33
HITS
  • HITS is also commonly used for document ranking.
  • Gives each page a hub score and an authority
    score
  • A good authority is pointed to by many good hubs.
  • A good hub points to many good authorities.
  • Users want good authorities.

34
Issues with Ranking Algorithms
  • Spurious keywords and META tags
  • Users reinforcing each other
  • Increases authority measure
  • Topic drift
  • Many hubs link to more than one topic

35
Web structure
  • Structure is important for
  • Predicting traffic patterns
  • Who will visit a site?
  • Where will visitors arrive from?
  • How many visitors can you expect?
  • Estimating coverage
  • Is a site likely to be indexed?

36
Core
  • Compact
  • Short paths between sites
  • Small world phenomenon
  • Distances are small relative to average path
    length
  • Number if inward and outward links follows a
    power law.
  • Mechanism preferential attachment
  • As new sites arrive, the probability of gaining
    an inward link is proportional to in-degree.

37
Power laws and small worlds
  • Power laws occur everywhere in nature
  • Distribution of site sizes, city sizes, incomes,
    word frequencies
  • Random networks tend to evolve according to a
    power law.
  • Small-world phenomenon
  • Neighborhoods will be joined by a common member
  • Hubs serve to connect neighborhoods
  • Linkage is closer than one might expect
  • Six Degrees of Separation, Kevin Bacon

38
Local structure
  • More diverse than a power law
  • Pages with similar topics self-organize into
    communities
  • Short average path length
  • High link density
  • Webrings
  • Inverse Does a high link density imply the
    existence of a community?
  • Can this be used to study the emergence and
    growth of web communities?

39
Hubs and Authorities
  • Common community structure
  • Hubs
  • Many outward links
  • Lists of resources
  • Authorities
  • Many inward links
  • Provide resources, content

40
Hubs and Authorities
Authorities
Hubs
Link structure estimates over 100,000 Web
communities Often not categorized by portals
41
Web Communities
  • Alternate definition
  • Each member has more links to community members
    than non-community members.
  • Extension of a clique.
  • Can be discovered with network flow algorithms.

42
Weaknesses of search engines
Write a Comment
User Comments (0)
About PowerShow.com