Title: Search and Discovery: Searching the Web
1Search and DiscoverySearching the Web
2Stages of a transaction
- Discovery
- Find what youre interested in
- Locate sellers
- Locate buyers
- Compare products
- Negotiation
- Exchange
3Discovery
- Encompasses
- Search engines
- Recommender systems
- Price comparison/shopping agents
- Description languages
- Data sources
- Generic sources portals, web directories
- Domain-specific sources catalogs, guides, etc.
- Advertising
4Discovery
- More than just finding a resource
- Need to be able to estimate value, likelihood of
successful negotiation - An evaluative infrastructure is required
- Least formalized of e-commerce subareas.
- Unlikely to have a general-purpose solution soon
- Too complex
5A Brief History of the Web
- Prehistory
- Hypertext as an idea has been around since the
40s. - Vannevar Bush Memex
- Engelbart 60s
- 1987 Hypercard
- Graphical tool allowing users to create
hyperlinked documents. - Late 80s/early 90s WAIS, Gopher
6A Brief History of the Web
- 1989/90 Tim Berners-Lee proposes the WWW at CERN
- A new global information retrieval system
- Develops HTML, a simple markup language
- 1993 Mosaic developed at NCSA
- Marc Andressen then founds Netscape
- 1993/94 NCSA httpd released
- Open-source web server, supported CGI
- Precursor to Apache
7A Brief History of the Web
- 1994 Banner ads appear on HotWired
- Beginning of the commercial web
- 1994 Yahoo founded
- Appearance of the portal, search engine
- 1995 NSF backbone privatized
- ATT, Sprint, etc take over traffic
- Network Solutions given a monopoly on domain
names - 1995 Microsoft releases Internet Explorer
- In 7 years, Netscape goes from 100 market share
to 20 (2001).
8A Brief History of the Web
- 1995 AltaVista started
- Full-text Web search
- 1995 Andressen first WWW billionaire
- 1995 Sun introduces Java
- Able to ship code and text across networks
- 1995 eBay founded
- First online auction
- 1995-98 Explosive growth
- Many new formats, applications, companies
- 1998 Akamai founded (web caching)
9A Brief History of the Web
- 1998 ICANN governs names addresses
- 1998 MP3 format popularized
- WinAmp released
- Small enough to make audio distribution practical
- 1998 Google founded.
- 2000 Napster appears
- Beginnings of peer-to-peer technology, file
sharing - 2000(ish) End of the boom
- Consolidation, reduction in growth
10Lessons from Radio
- Radio was popularized in the 1920s
- Originally intended as a one-to-one messaging
system. - Fee-for-use pay structure.
- 1922 Explosive growth begins
- RCAs revenues from sales of receivers doubled
each year - Broadcast model becomes prevalent
- Thousands of broadcasters emerge
11Lessons From Radio
- 1922-1924 Transition
- How to make money broadcasting?
- Support sale of receivers
- Goodwill (sponsors)
- Public good supported as a non-profit
- Advertising
- Tube tax/set tax (a la BBC)
- By 1924, stations are failing as quickly as they
start.
12Lessons From Radio
- Affordable content driven by audience size
- Rich-get-richer for large stations
- 1926 RCA launches NBC
- First nationwide broadcast
- Creates the network system
- National content, local broadcasting
- Advertising the dominant revenue generator
- WWW questions
- Who will be NBC?
- What will the revenue model be?
- Advertising? Competition with TV, radio for this
revenue. - Micropayments? Subscriptions? Content
aggregation?
13Searching the Web
- Web growth estimated at 1000 in late 90s.
- Can search engines keep up with this growth?
- How to deal with the dynamic nature of the web?
- Page contents change
- Pages appear, disappear, move
- Link structure changes
14Search Engines
- Most common form of discovery
- Crawl the web to collect pages
- Stored and indexed for easy retrieval
- Query languages simple
- Goals
- Fast retrieval (Google gets 150 million queries
per day) - Accurate (no dead links)
- Precise (pages match users needs)
15Terminology
- Outward link
- Object that a page links to
- Outdegree number of outward links
- Inward link
- Pages that link to an object
- Indegree number of inward links
- Path
- Series of outward links from A to B
16The Web as a Directed Graph
- We can represent the web as a directed graph.
- Sites are nodes
- Links are edges.
- Outward link
- Object that a page links to
- Inward link
- Pages that link to an object
17The Web as a Directed Graph
18Adjacency Matrix
- We can also represent the Web as a very large
adjacency matrix. - The eigenvector of this matrix illustrates the
clusteredness of the Web - Distribution of in-degree and out-degree
- Connectedness
- Some ranking algorithms (HITS) use this measure.
19Web structure
- Web can be broken into four areas
(Kleinberg/Lawrence) - Core Path between any two pages
- Upstream Can reach the core, but no path from
core. - Downstream can be reached from core, but cannot
reach core. - Tendrils/islands disconnected from the core.
- Areas (allegedly) have roughly equal size.
20Coverage
- Search engines claim they index a large fraction
of the web. - How to verify this?
- Run queries on many engines and compare number of
hits. - May return irrelevant documents
- Documents may no longer exist
- Documents may have changed
21Coverage
- NEC (1998) Estimate size of web, coverage for
major search engines. - Query each engine, retrieve and compare all
results (only exact matches). - Coverage estimates
- HotBot 57, AltaVista 46
- NorthernLight 33, Excite 23
- Infoseek 16, Lycos 4
22Estimating the size of the indexable web
- Overlap in coverage was used to estimate size.
A
B
U
U/B serves as an estimate of A/N, where N is the
size of the Web. 1998 Altavista/Hotbot
estimate 320 million pages.
23Using size to refine coverage estimates.(1997)
- This value can then be used to determine a
coverage estimate for each engine. - For each pair, solve for N.
- Assume real N is largest found.
- Updated HotBot 34, AltaVista 28
- NorthernLight 20, Excite 14
- Infoseek 10, Lycos 3
24Updates (1999)
- Web growth ahead of indexing
- No search engine covers more than 16 of the Web.
- Union of all engines 50 coverage
- Estimated size 800 million pages
- Search engines more likely to link to authorities
- More likely to link to US, commercial sites.
25Updates (12/2001)
- Self-reported number of pages indexed
- Google 2 billion (3 billion today)
- FAST (AllTheWeb.com) 625 million
- (claimed 2.1 billion in 2002)
- Altavista 550 million
- Inktomi 500 million
- NorthernLight 390 million
26Indexing the web
- Spiders are used to crawl the web and collect
pages. - A page is downloaded and its outward links are
found. - Each outward link is then downloaded.
- Exceptions
- Links from CGI interfaces
- Robot Exclusion Standard
27Indexing the Web
- Stop words stripped from page
- Forward index created
- Bundles words
- Maps words to documents.
- Can use TFIDF to only map significant keywords
- Term Frequency InverseDocumentFrequency
28Indexing the web
- An inverted index is created
- Forward index sorted according to word
- Maps keywords to URLs
- Some wrinkles
- Morphology stripping suffixes (stemming),
singular vs. plural, tense, case folding - Semantic similarity
- Words with similar meanings share an index.
- Issue trading coverage (number of hits) for
precision (how closely hits match request)
29Indexing Issues
- Indexing techniques were designed for static
collections - How to deal with pages that change?
- Periodic crawls, rebuild index.
- Varied frequency crawls
- Records need a way to be purged
- Hash of page stored
- Can use the text of a link to a page to help
label that page. - Helps eliminate the addition of spurious keywords.
30Indexing Issues
- Availability and speed
- Most search engines will cache the page being
referenced. - Multiple search terms
- OR separate searches concatenated
- AND intersection of searches computed.
- Regular expressions not typically handled.
- Parsing
- Must be able to handle malformed HTML, partial
documents
31PageRank
- Google uses PageRank to determine relevance.
- Based on the quality of a pages inward links.
- Average the PageRanks of each page that points to
a given page, divided by their outdegree. - Let p be a page, with T1 Tn linking to p.
- PR(p) (1-d) d(SumI(Pr(TI)/outI))
- d is a damping factor.
- PR propagates through a graph.
32PageRank
- Justification
- Imagine a random surfer who keeps clicking
through links. - d is the probability she starts a new search.
- Or
- A page has a high ranking if highly ranked pages
point to it. - Pros difficult to game the system
- Cons Creates a rich get richer web structure
where highly popular sites grow in popularity.
33HITS
- HITS is also commonly used for document ranking.
- Gives each page a hub score and an authority
score - A good authority is pointed to by many good hubs.
- A good hub points to many good authorities.
- Users want good authorities.
34Issues with Ranking Algorithms
- Spurious keywords and META tags
- Users reinforcing each other
- Increases authority measure
- Topic drift
- Many hubs link to more than one topic
35Web structure
- Structure is important for
- Predicting traffic patterns
- Who will visit a site?
- Where will visitors arrive from?
- How many visitors can you expect?
- Estimating coverage
- Is a site likely to be indexed?
36Core
- Compact
- Short paths between sites
- Small world phenomenon
- Distances are small relative to average path
length - Number if inward and outward links follows a
power law. - Mechanism preferential attachment
- As new sites arrive, the probability of gaining
an inward link is proportional to in-degree.
37Power laws and small worlds
- Power laws occur everywhere in nature
- Distribution of site sizes, city sizes, incomes,
word frequencies - Random networks tend to evolve according to a
power law. - Small-world phenomenon
- Neighborhoods will be joined by a common member
- Hubs serve to connect neighborhoods
- Linkage is closer than one might expect
- Six Degrees of Separation, Kevin Bacon
38Local structure
- More diverse than a power law
- Pages with similar topics self-organize into
communities - Short average path length
- High link density
- Webrings
- Inverse Does a high link density imply the
existence of a community? - Can this be used to study the emergence and
growth of web communities?
39Hubs and Authorities
- Common community structure
- Hubs
- Many outward links
- Lists of resources
- Authorities
- Many inward links
- Provide resources, content
40Hubs and Authorities
Authorities
Hubs
Link structure estimates over 100,000 Web
communities Often not categorized by portals
41Web Communities
- Alternate definition
- Each member has more links to community members
than non-community members. - Extension of a clique.
- Can be discovered with network flow algorithms.
42Weaknesses of search engines