Title: WEB SEARCH and P2P
1WEB SEARCH and P2P
- Advisor Dr Sushil Prasad
- Presented By DM Rasanjalee Himali
2OUTLINE
- Introduction to web search engines
- What is a web search engine?
- Web Search engine architecture
- How a web search engine work?
- Relevance and Ranking
- Limitations in current Web Search Engines
- P2P Web Search Engines
- YouSearch
- Copeer
- ODISSEA
- Conclusion
3What is a web search engine?
- A Web search engine is a search engine designed
to search for information on the World Wide Web. - Information may consist of web pages, images and
other types of files. - Some search engines also mine data available in
newsgroups, databases, or open directories
4History
Company Millions of searches Relative market share
Google 28,454 46.47
Yahoo! 10,505 17.16
Baidu 8,428 13.76
Microsoft 7,880 12.87
NHN 2,882 4.71
eBay 2,428 3.9
Time Warner 1,062 1.6
Ask.com 728 1.1
Yandex 566 0.9
Alibaba.com 531 0.8
Total 61,221 100.0
- Before there were search engines there was a
complete list of all webservers. - The very first tool used for searching on the
Internet was Archie - downloaded directory listings of files on FTP
sites - did not index the contents of these sites
- Soon after, many search engines appeared
- Excite, Infoseek, Northern Light, AltaVista.
Yahoo!, Google, MSN Search
5How Web Search Engine Work
- A search engine operates, in the following order
- Web crawling
- Indexing
- Searching
6Web Crawling
- A web crawler
- a program or which browses the World Wide Web in
a methodical, automated manner. - a means of providing up-to-date data
- create a copy of all the visited pages for later
processing by a search engine - starts with a list of URLs to visit, called the
seeds. - As the crawler visits these URLs, it identifies
all the hyperlinks in the page and adds them to
the list of URLs to visit, called the crawl
frontier. - URLs from the frontier are recursively visited
according to a set of policies.
7Robot Exclusion Protocol
- also known as the robots.txt protocol
- is a convention to prevent cooperating web robots
from accessing all or part of a website which is
otherwise publicly viewable. - User-agent
- Disallow /cgi-bin/
- Disallow /images/
- Disallow /tmp/
- Disallow /private/
- Sitemap http//www.example.com/sitemap.xml.gz
- Crawl-delay 10
- Allow /folder1/myfile.html
- Request-rate 1/5 maximum rate is one page
every 5 seconds - Visit-time 0600-0845 only visit between
0600 and 0845 UTC (GMT) - It relies on the cooperation of the web robot, so
that marking an area of a site out of bounds with
robots.txt does not guarantee privacy. - The standard complements Sitemaps, a robot
inclusion standard for websites.
8SiteMap Protocol
- allows a webmaster to inform search engines about
URLs on a website that are available for
crawling. - A Sitemap is an XML file that lists the URLs for
a site. - It allows webmasters to include additional
information about each URL - when it was last updated,
- how often it changes, and
- how important it is in relation to other URLs in
the site. - This allows search engines to crawl the site more
intelligently. - Sitemaps are a URL inclusion protocol complement
robots.txt
lt?xml version"1.0" encoding"UTF-8"?gt lturlset
xmlns"http//www.sitemaps.org/schemas/sitemap/0.9
"gt    lturlgt       ltlocgthttp//www.example.com/lt
/locgt      ltlastmodgt2005-01-01lt/lastmodgt
      ltchangefreqgtmonthlylt/changefreqgt
      ltprioritygt0.8lt/prioritygt    lt/urlgt
lt/urlsetgt
9Distributed Web Crawling
- Internet search engines employ many computers to
index the Internet via web crawling. - Dynamic assignment
- a central server assigns new URLs to different
crawlers dynamically. - allows the central server to, dynamically balance
the load of each crawler. - Static assignment
- there is a fixed rule stated from the beginning
of the crawl that defines how to assign new URLs
to the crawlers. - Google uses thousands of individual computers in
multiple locations to crawl the Web.
10Indexing
- The purpose of storing an index is to optimize
speed and performance in finding relevant
documents for a search query - Search engine indexing collects, parses, and
stores data to facilitate fast and accurate
information retrieval. - The contents of each page are analyzed to
determine how it should be indexed - Ex words are extracted from the titles,
headings, or special fields called meta tags - Meta search engines reuse the indices of other
services and do not store a local index
11Challenges in Parallelism
- A major challenge in the design of search engines
is the management of parallel computing
processes. - There are many opportunities for race conditions
and coherent faults. - Ex a new document is added to the corpus and the
index must be updated, but the index
simultaneously needs to continue responding to
search queries. - the search engine's architecture may involve
distributed computing, where the search engine
consists of several machines operating in unison.
- This increases the possibilities for incoherency
and makes it more difficult to maintain a
fully-synchronized, distributed, parallel
architecture.
12Inverted Indices
- inverted index stores a list of the documents
containing each word - search engine can use direct access to find the
documents associated with each word in the query
to retrieve the matching documents quickly
Word Documents
the Document 1, Document 3, Document 4, Document 5
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7
13Searching
- web search query
- a query that a user enters into web search engine
to satisfy his or her information needs. - is distinctive in that it is unstructured and
often ambiguous - vary greatly from standard query languages which
are governed by strict syntax rules.
14Searching
- Three broad categories that cover most web search
queries - Informational queries
- Queries that cover a broad topic (e.g., colorado
or trucks) for which there may be thousands of
relevant results. - Navigational queries
- Queries that seek a single website or web page of
a single entity (e.g., youtube or delta
airlines). - Transactional queries
- Queries that reflect the intent of the user to
perform a particular action, like purchasing a
car or downloading a screen saver.
15Web search engine architecture
Fetched pages
URL List
Compress store
Anchors file
read
- Relative URLs
- ? absolute URLs
- ? docIDs
links
- Read repository - Uncompress parse docs to
hit list - Distribute hit list to baralles by
docID - Parse out links and store in anchor file
Anchor text --docIDs
Partiall sorted forward index
Resort baralls by word IDs
lexicon
Inverted index
Calculate PR of all docs
Answer queries
PR
From The Anatomy of a Large-Scale
Hypertextual Web Search Engine Sergey Brin and
Lawrence Page
16Important Properties Of Commercial Web Search
- To be successful a commercial Search Engine must
address all of these issues/properties - Millions of heterogeneous users
- Goal is to make money
- UI is extremely important
- Real-time/fast expectation
- Content of web page not sufficient to imply
meaning - Result ranking cannot assume independence
- Must consider maliciousness
- No quality control on pages (quality varies)
- Web is large (practically infinite)
- Millions of heterogeneous users
17Relevance and Ranking
- Exactly how a particular search engine's
algorithm works is a closely-kept trade secret. - However, all major search engines follow the
general rules below. - Location, Location, Location...and Frequency
- Location
- Search engines will also check to see if the
search keywords appear near the top of a web
page, such as in the headline or in the first few
paragraphs of text. They assume that any page
relevant to the topic will mention those words
right from the beginning. - Frequency
- A search engine will analyze how often keywords
appear in relation to other words in a web page.
Those with a higher frequency are often deemed
more relevant than other web pages.
18Precision and Recall
- two widely used measures for evaluating the
quality of results in Information Retrieval - Precision
- fraction of the documents retrieved that are
relevant to the user's information need - number of relevant documents retrieved by a
search ___________________________________________
________ the total number of documents retrieved
by that search - Recall
- the fraction of the documents that are relevant
to the query that are successfully retrieved. - number of relevant documents retrieved by a
search ___________________________________________
__________ the total number of existing relevant
documents which should have been retrieved - Often, there is an inverse relationship between
Precision and Recall
19Relevance and Ranking
- webmasters constantly rewrite their web pages in
an attempt to gain better rankings. - Some sophisticated webmasters may even "reverse
engineer" the location/frequency systems used by
a particular search engine - Because of this, all major search engines now
also make use of "off the page" ranking criteria
20Relevance and Ranking
- Off the page factors
- those that a webmasters cannot easily influence
- Link analysis
- Search engine analyzing how pages link to each
other - Helps to determine what a page is about and
whether that page is "important" and thus
deserving of a ranking boost - Click through measurement
- a search engine watch what results someone
selects for a particular search, - eventually drop high-ranking pages that aren't
attracting clicks, - promote lower-ranking pages that do pull in
visitors.
21Limitations in current web search engines
- Centralized search engines have limited
scalability. - crawler based indices are stale and incomplete
- Fundamental issue How much of the web is
crawlable - If you follow the rules many sites say robots
get lost - What about Dynamic content? (Deep Web)
- The deep web is around 500 times larger than
surface web. These deep web resources mainly
include data held by databases which can be
accessed only through queries. Since crawlers
discover resources only through links, they
cannot discover these resources. - Theres no guarantee that current search engines
index or even crawl the total surface web space
22Limitations in current web search engines
- Single point of failure
- Ambiguous words
- Polysemy - words with multiple meanings train
car train neural network - Synonymy - multiple words same meaning neural
network is trained as follows neural network
learns as follows - What about phrases - searches are not bag of
words - Positional information? Structural (throw out
case punctuation)? - Non-text content data worth storing
- Most web search engines today crawl only surface
web.
23P2P Web Search
- Seen explosion of activity in the area of
peer-to-peer (P2P) systems last few years - Since an increasing amount of content now resides
in P2P networks, it becomes necessary to provide
search facilities within P2P networks. - The significant computing resources provided by a
P2P system could also be used to implement search
and data mining functions for content located
outside the system - e.g., for search and mining tasks across large
intranets or global enterprises, or even to build
a P2P-based alternative to the current major
search engines.
24P2P Web Search
- The characteristics distinguish P2P systems from
previous technologies - low maintenance overhead
- improved scalability
- Improved reliability
- synergistic performance
- increased autonomy and privacy
- Dynamism
25P2P Web Search Engines
- YouSearch
- Coopeer
- ODISSEA
26YouSearch
- YouSearch
- is a distributed search application for personal
webservers operating within a shared context - Allow peers to aggregate into groups and users to
search over specific groups - Goal
- Provide fast, fresh and complete results to users
27YouSearch
- System Overview
- participants in YouSearch
- Peer-nodes
- run YouSearch enabled clients
- Browsers
- search YouSearch enabled content through their
web browsers - Registrar
- centralized light-weight service that
- acts like a blackboard on which peer nodes
store and lookup (summarized) network state.
28YouSearch
- System Overview
- Search System
- Each peer node closely monitors its own content
to maintain a fresh local index - A bloom filter content summary is created by each
peer and pushed to the registrar. - When a browser issues a search query at a peer p
, the peer p first queries the summaries at the
registrar to obtain a set of peers R in the
network that are hosting relevant documents. - The peers in R are then directly contacted by
with the query to obtain the URLs for its
results. - To quickly satisfy any subsequently issued
queries with identical terms, the results from
each query issued at a peer p are cached for a
limited time at p
29YouSearch
- Indexing
- Indexing is periodically executed at every peer
node. - Inspector examines each shared file for its last
modification date and time. - If the file is new or the file has changed, the
file is passed to the Indexer. - The Indexer maintains a disk-based inverted-index
over the shared content. - The name and path information of the file are
indexed as well.
30YouSearch
- Indexing
- Summarizer
- The Summarizer obtains a list of terms T from the
Indexer and creates a bloom filter from them in
the following way. - A bit vector V of length L is created with each
bit set to 0. - A specified hash function H with range 1,...,L
is used to hash each term t in T and the bit at
position H(t) in V is set to 1 - YouSearch use k independent hash functions
H1,H2,...,Hk and construct k different bloom
filters, one for each hash function - In YouSearch,
- the length of each bloom filter is L 64 Kbits
and - the number of bloom filters k is set to 3
- Summary Manager at the registrar aggregate these
Bloom Filters into a structure that maps each bit
position to a set of peers whose Bloom Filters
have the corresponding bit set
31YouSearch
query
query
computes the hash of keywords
keywords
determine
intersection of peer I s
looks up mapping
Corresponding bits of each k bloom filters
keywords
intersection of peer I s
results
Bit position to IP address mapping
contacts each of the peers in list and obtains a
list of URLs for matching documents
32YouSearch
- Caching
- Every time a global query is answered that
returns non-zero results, the querying peer
caches the result set of URLs U (temporary) -
- The peer then informs the registrar of the fact.
-
- The registrar adds a mapping from the query to
the IP-address of the caching peer in its cache
table
33YouSearch
- Limitations
- False Positive results 17.38
- Central registrar gtgt single point of failure
- No extensive phrase search
- No attention has been given for query ranking
- No human user collaboration
34Coopeer
- Coopeer
- Is a P2P web search engine where each user
computer stores a part of the web model used for
indexing and retrieving web resources in response
to queries - Goal
- complement centralized search engines to provide
more humanized and personalized results by
utilizing users collaboration
35Coopeer
- (a)Collaboration
- One may look for interesting web pages in the P2P
knowledge repository consisted with shared web
pages. - A novel collaborative filtering technique called
PeerRank is presented to rank pages proportional
to the votes from relevant peers - (b)Humanization
- Coopeer use a query-based representation for
documents, - The relevant words are not directly extracted
from page content but introduced by human users
with a high proficiency in their expertise
domains. - (c)Personalization
- Similar users are self-organized according to
their semantic content of search session. - Thus, requestor peer can extend routing paths
along its neighbors, rather than just take a
blind shot. - User-customized results can be obtained along
personal routing paths in contrast with CSEs.
36Coopeer
- System Overview
- requestor forwards the query based on the
semantically routing. - Peers maintain a local index about the semantic
content of remote peers. - Receiving a query message from remote peer,
current peer check it against the local store. - In order to facilitate this work, a novel
query-based representation about documents is
introduced. - Based on query representation, cosine similarity
between new query and documents can be computed. - the documents are relevant enough, if the
similarity exceeds a certain threshold. - Then these results are returned to the requestor.
- Receiving the returned results, the requestor
peer need to rank them in term of preference of
its human owner using PeerRank method.
37Coopeer
- The Coopeer client consists of four main software
agents - The User Agent
- is responsible for interacting with the users.
- It provides a friendly user interface, so that
users can conveniently manage and manipulate the
whole search sessions. - The Web-searcher Agent
- is the resource of P2P knowledge repository.
- It performs the users individual searching with
several search engines from the Internet. - The Collaborator Agent
- is the key component for performing users
real-time collaborative searching. - It facilitates maintaining the P2P knowledge
repository, such as information sharing,
searching, and fusion. - The Manager Agent
- is the key component of Coopeer, which
coordinates and manage the other types of agents.
- It is also responsible for updating and
maintaining data.
38Coopeer
- PeerRank
- All the users are taken as a Referrer Network.
- Determines pages relevance by examining a
radiating network of referrers. - Documents with more referrers gain higher ranks.
- Obtain better rank order, as collaborative
evaluation of human users is much more precise
than description of term frequency or link
amount. - Prevent spam, since it is difficult to pretend
evaluation from human users.
39Coopeer
- PeerRank
- For a given search session, we firstly compute
the similarity between requestors favorite lists
and referrers, - then the similarity is used as the baseline of
recommending degree of the referrer. - Firstly, as shown in equation (1), the similarity
of local list and recommended list is given by
the Kendall measure. - Secondly, we convert the rank of a given URL in
its recommended list to a moderate score - R(e) - weight of URL e.
- C (e) - set constituted by es referrers.
- Z - constant gt 1.
- p - local peer
- Pi - a remote peer,
- Lp , Lpi - list of p and Pi respectively.
- K(r)(Lp, Lpi ) -Kendall function to measure the
distance of the local list and the recommended
list, - r decay factor.
- SLpi(e) - score of e in the recommended list.
- Re - rank of e and
- RMax - highest rank of list pi, the length of
the list.
40Coopeer
- Kendall Measure
- Kendall is used to measure the distance between
two lists in the same length. - Paper extend it to fit in with measuring two
lists in different length. - Kendall function
- t1 and t2 - two lists composed with URLs
- Kr(t1, t2) -the distance between t1 and t2,
- r fixed parameter with 0 r 1. C2
- 2L - used for normalization is the possible
maximum of the distance. - U(t1, t2) - set consists of all the URLs in t1
and t2, - K ri,j(t1, t2) - means the penalty of the URL
pair (i, j)
41Coopeer
- Query Based Representation
- A novel type of representation based on the
relevant words introduced by human users with a
high proficiency in their expertise domains. - is efficient on the P2P platform, as the users
evaluation can be utilized easily through the
client application. - represent and organize the local documents for
responding remote query
42Coopeer
- Each peer maintains
- an inverted index table
- represent local documents for responding remote
query - the IDs of the documents that were replied the
query - key of inverted index is terms extracted from the
previous queries - Ex when peer j writes in two queries P2P
Overlay and P2P Routing and obtains two set of
documents, d1, d2, d3 and d3, d4
respectively. - The retrieved documents will be updated with
their corresponding query terms. - When any other peer issues a query about Overlay
Routing Algorithm, peer j would look up relevant
documents in the inverted index by using VSM
cosine similarity as ranking algorithm, and d3
would gain the highest ranking.
43Coopeer
- Semantic Routing Algorithm
- each Coopeer client maintains a local Topic
Neighbor Index - The index records the used performance of remote
peers which has similar topics to the local peer. - These search sessions queries are used to
represent the peers semantic content
- session 1 gtgt is the local peer which has two
topics (queries) - other sessions below denote the remote peers are
interested in by the local peer in some aspect. - session 2 and 3 are relevant to P2P Routing
topic of local peer, while others are about
Pattern Recognition. - The peers on a same topic are in descending order
of the rate. - The peers providing more interested resource
would move to the top of an individuals local
index
44Coopeer
- with query-based inverted index, the precision of
matching results of different subjects was almost
100 - system uses information coming from centralized
search engines, so the system is not aimed to
replace CSEs, but to complement them.
45Coopeer
- Query based representation is Efficient in p2p
because users evaluation can be utilized easily
through the client application. - This is Inefficient in CSEs because gaining user
evaluation through web browser is inefficient
impractical to store and index documents every
users query. - Prevent spam, since it is difficult to pretend
evaluation from human users. - Use human searching experience ?better results
46ODISSEA
- A distributed global indexing and query execution
service - Maintains a global index structure under document
insertions and updates and node joins and
failures - the inverted index for a particular term (word)
is located at a single node, or partitioned over
a small number of nodes in some hybrid
organizations. - Assume two tier architecture.
- The system is implemented on top of an underlying
global address space provided by a DHT structure
47ODISSEA
- System provide the lower tier of the two tier
architecture. - In the upper tier, there are two classes of
clients that interact with this P2P-based lower
tier - Update clients
- insert new or updated documents into the system,
which stores and indexes them. - An update client could be a crawler inserting
crawled pages, a web server pushing documents
into the index, or a node in a file sharing
system. - Query clients
- design optimized query execution plans, based on
statistics about term frequencies and
correlations, and issue them to the lower tier.
48ODISSEA
49ODISSEA
- Global Index
- An inverted index for a document collection is a
data structure that contains for each word in the
collection a list of all its occurrences, or a
list of postings. - Each posting contains the document ID of the
occurrence of the word, its position inside the
document, and other information (in title? bold
face?) - each node holds a complete global postings list
for a subset of the words, as determined by a
hash function.
50ODISSEA
- Query Processing
- a ranking function is a function F that, given a
query consisting of a set of search terms
q0,q1,,qm-1 , assigns to each document d a score
F(d, q0,q1,,qm-1) . The top- k ranking problem
is then the problem of identifying the k
documents in the collection with the highest
scores.
51ODISSEA
- We focus on two families of ranking functions,
- The first family includes the common families of
term-based ranking functions used in IR, where we
add up the scores of each document with respect
to all words in the queries. - The second formula adds a query-independent value
g(d) to the score of each page
52ODISSEA
- Fagins Algorithm
- Consider the inverted lists for a search query
with two terms q0 and q1 . - Assume they are located on the same machine, and
that the postings in the list are pairs
(d,f(d,qi)),i ?0,1, where d is an integer
identifying the document and " f(d,qi) is real
valued. - Assume each inverted list is sorted by the second
attribute, so that documents with largest "
f(d,qi) are at the start of the list. - Then the following algorithm, called FA, computes
the top-k results
53ODISSEA
- FA
- (1)Scan both lists from the beginning, by reading
one element from each list in every step, until
there are documents that have each been
encountered in both of the lists. - (2) Compute the scores of these documents. Also,
for each document that was encountered in only
one of the lists, perform a lookup into the other
list to determine the score of the document.
Return the documents with the highest score.
54Conclusion
- Still no P2P web search engine has outperformed
Google! - () Lot of resources for complex data mining
tasks and for crawling whole surface web - ()Emergence of semantic communities also has a
positive impact on p2p web search performance - (-)lack of global knowledge
- (-)smart crawling strategies beyond BFS are hard
to implement in a P2P environment without a
centralized scheduler.
55Some Open Problems
- how to uniformly sample web pages on a web site
if one does not have an exhaustive list of these
pages? - Bar-Yosseff converted the web graph into an
undirected, connected, and regular graph. - The equilibrium of a random walk on this graph is
the uniform distribution. - It is not clear how many steps such a walk needs
to perform. - A more significant problem, however, is that
there is no reliable way of converting the web
graph into an undirected graph.
56Some Open Problems
- Data Streams
- The query logs of a web search engine contain all
the queries issued at this search engine. - The most frequent queries change only slowly over
time. - However, the queries with the largest increase or
decrease from one time period over the next show
interesting trends in user interests. We call
them the top gainers and losers. - Since the number of queries is huge, the top
gainers and losers need to be computed by making
only one pass over the query logs. - This leads to the following data stream problem
- Another interesting variant is to find all items
above a certain frequency whose relative increase
(i.e., their increase divided by their frequency
in the first sequence) is the largest.
57References
- The anatomy of a large-scale hypertextual Web
search engineSource Computer Networks and ISDN
Systems Volume 30 , Issue 1-7  ,1998 Sergey Brin
Lawrence Page - Make it fresh, make it quick searching a network
of personal International World Wide Web
Conference Budapest, Hungary , 2003 - Towards a Fully Distributed P2P Web Search
EngineProceedings of the 10th IEEE International
Workshop on Future Trends of Distributed
Computing Systems Jin Zhou, Kai Li and Li Tang
 2004 - Odissea A peer-to-peer architecture for scalable
web search and information retrieval by T Suel,
C Mathur, J Wu, J Zhang, A Delis, M Kharrazi, X
Long, K Shanmugasunderam , 2003 - Space/time Trade-offs in Hash Coding with
Allowable Errors B. Bloom. In Communications of
ACM, volume 13(7), pages 422426, 1970 - www.en.wikipedia.org
58Extra Slides
59Bloom Filters
- a space-efficient probabilistic data structure
that is used to test whether an element is a
member of a set. - False positives are possible, but false negatives
are not. - The more elements that are added to the set, the
larger the probability of false positives.
60Bloom Filters
- An empty Bloom filter is a bit array of m bits,
all set to 0. - There must also be k different hash functions
defined, each of which maps a key value to one of
the m array positions. - To add an element, feed it to each of the k hash
functions to get k array positions. Set the bits
at all these positions to 1. - To query for an element (test whether it is in
the set), feed it to each of the k hash functions
to get k array positions. - If any of the bits at these positions are 0, the
element is not in the set if it were, then all
the bits would have been set to 1 when it was
inserted. - If all are 1, then either the element is in the
set, or the bits have been set to 1 during the
insertion of other elements.
61Bloom Filters
An example of a Bloom filter, representing the
set x, y, z. The colored arrows show the
positions in the bit array that each set element
is mapped to. The element w, not in the set, is
detected as a nonmember as it is mapped to a
position containing a 0.
62Bloom Filters