WEB SEARCH and P2P - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

WEB SEARCH and P2P

Description:

WEB SEARCH and P2P Advisor: Dr Sushil Prasad Presented By: DM Rasanjalee Himali – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 59
Provided by: Him53
Learn more at: https://www.cs.gsu.edu
Category:

less

Transcript and Presenter's Notes

Title: WEB SEARCH and P2P


1
WEB SEARCH and P2P
  • Advisor Dr Sushil Prasad
  • Presented By DM Rasanjalee Himali

2
OUTLINE
  • Introduction to web search engines
  • What is a web search engine?
  • Web Search engine architecture
  • How a web search engine work?
  • Relevance and Ranking
  • Limitations in current Web Search Engines
  • P2P Web Search Engines
  • YouSearch
  • Copeer
  • ODISSEA
  • Conclusion

3
What is a web search engine?
  • A Web search engine is a search engine designed
    to search for information on the World Wide Web.
  • Information may consist of web pages, images and
    other types of files.
  • Some search engines also mine data available in
    newsgroups, databases, or open directories

4
History

Company Millions of searches Relative market share
Google 28,454 46.47
Yahoo! 10,505 17.16
Baidu 8,428 13.76
Microsoft 7,880 12.87
NHN 2,882 4.71
eBay 2,428 3.9
Time Warner 1,062 1.6
Ask.com 728 1.1
Yandex 566 0.9
Alibaba.com 531 0.8
Total 61,221 100.0
  • Before there were search engines there was a
    complete list of all webservers.
  • The very first tool used for searching on the
    Internet was Archie
  • downloaded directory listings of files on FTP
    sites
  • did not index the contents of these sites
  • Soon after, many search engines appeared
  • Excite, Infoseek, Northern Light, AltaVista.
    Yahoo!, Google, MSN Search

5
How Web Search Engine Work
  • A search engine operates, in the following order
  • Web crawling
  • Indexing
  • Searching

6
Web Crawling
  • A web crawler
  • a program or which browses the World Wide Web in
    a methodical, automated manner.
  • a means of providing up-to-date data
  • create a copy of all the visited pages for later
    processing by a search engine
  • starts with a list of URLs to visit, called the
    seeds.
  • As the crawler visits these URLs, it identifies
    all the hyperlinks in the page and adds them to
    the list of URLs to visit, called the crawl
    frontier.
  • URLs from the frontier are recursively visited
    according to a set of policies.

7
Robot Exclusion Protocol
  • also known as the robots.txt protocol
  • is a convention to prevent cooperating web robots
    from accessing all or part of a website which is
    otherwise publicly viewable.
  • User-agent
  • Disallow /cgi-bin/
  • Disallow /images/
  • Disallow /tmp/
  • Disallow /private/
  • Sitemap http//www.example.com/sitemap.xml.gz
  • Crawl-delay 10
  • Allow /folder1/myfile.html
  • Request-rate 1/5 maximum rate is one page
    every 5 seconds
  • Visit-time 0600-0845 only visit between
    0600 and 0845 UTC (GMT)
  • It relies on the cooperation of the web robot, so
    that marking an area of a site out of bounds with
    robots.txt does not guarantee privacy.
  • The standard complements Sitemaps, a robot
    inclusion standard for websites.

8
SiteMap Protocol
  • allows a webmaster to inform search engines about
    URLs on a website that are available for
    crawling.
  • A Sitemap is an XML file that lists the URLs for
    a site.
  • It allows webmasters to include additional
    information about each URL
  • when it was last updated,
  • how often it changes, and
  • how important it is in relation to other URLs in
    the site.
  • This allows search engines to crawl the site more
    intelligently.
  • Sitemaps are a URL inclusion protocol complement
    robots.txt

lt?xml version"1.0" encoding"UTF-8"?gt lturlset
xmlns"http//www.sitemaps.org/schemas/sitemap/0.9
"gt     lturlgt        ltlocgthttp//www.example.com/lt
/locgt       ltlastmodgt2005-01-01lt/lastmodgt
       ltchangefreqgtmonthlylt/changefreqgt
       ltprioritygt0.8lt/prioritygt     lt/urlgt
lt/urlsetgt
9
Distributed Web Crawling
  • Internet search engines employ many computers to
    index the Internet via web crawling.
  • Dynamic assignment
  • a central server assigns new URLs to different
    crawlers dynamically.
  • allows the central server to, dynamically balance
    the load of each crawler.
  • Static assignment
  • there is a fixed rule stated from the beginning
    of the crawl that defines how to assign new URLs
    to the crawlers.
  • Google uses thousands of individual computers in
    multiple locations to crawl the Web.

10
Indexing
  • The purpose of storing an index is to optimize
    speed and performance in finding relevant
    documents for a search query
  • Search engine indexing collects, parses, and
    stores data to facilitate fast and accurate
    information retrieval.
  • The contents of each page are analyzed to
    determine how it should be indexed
  • Ex words are extracted from the titles,
    headings, or special fields called meta tags
  • Meta search engines reuse the indices of other
    services and do not store a local index

11
Challenges in Parallelism
  • A major challenge in the design of search engines
    is the management of parallel computing
    processes.
  • There are many opportunities for race conditions
    and coherent faults.
  • Ex a new document is added to the corpus and the
    index must be updated, but the index
    simultaneously needs to continue responding to
    search queries.
  • the search engine's architecture may involve
    distributed computing, where the search engine
    consists of several machines operating in unison.
  • This increases the possibilities for incoherency
    and makes it more difficult to maintain a
    fully-synchronized, distributed, parallel
    architecture.

12
Inverted Indices
  • inverted index stores a list of the documents
    containing each word
  • search engine can use direct access to find the
    documents associated with each word in the query
    to retrieve the matching documents quickly

Word Documents
the Document 1, Document 3, Document 4, Document 5
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7
13
Searching
  • web search query
  • a query that a user enters into web search engine
    to satisfy his or her information needs.
  • is distinctive in that it is unstructured and
    often ambiguous
  • vary greatly from standard query languages which
    are governed by strict syntax rules.

14
Searching
  • Three broad categories that cover most web search
    queries
  • Informational queries
  • Queries that cover a broad topic (e.g., colorado
    or trucks) for which there may be thousands of
    relevant results.
  • Navigational queries
  • Queries that seek a single website or web page of
    a single entity (e.g., youtube or delta
    airlines).
  • Transactional queries
  • Queries that reflect the intent of the user to
    perform a particular action, like purchasing a
    car or downloading a screen saver.

15
Web search engine architecture
Fetched pages
URL List
Compress store
Anchors file
read
  • Relative URLs
  • ? absolute URLs
  • ? docIDs

links
- Read repository - Uncompress parse docs to
hit list - Distribute hit list to baralles by
docID - Parse out links and store in anchor file
Anchor text --docIDs
Partiall sorted forward index
Resort baralls by word IDs
lexicon
Inverted index
Calculate PR of all docs
Answer queries
PR
From The Anatomy of a Large-Scale
Hypertextual Web Search Engine Sergey Brin and
Lawrence Page
16
Important Properties Of Commercial Web Search
  • To be successful a commercial Search Engine must
    address all of these issues/properties
  • Millions of heterogeneous users
  • Goal is to make money
  • UI is extremely important
  • Real-time/fast expectation
  • Content of web page not sufficient to imply
    meaning
  • Result ranking cannot assume independence
  • Must consider maliciousness
  • No quality control on pages (quality varies)
  • Web is large (practically infinite)
  • Millions of heterogeneous users

17
Relevance and Ranking
  • Exactly how a particular search engine's
    algorithm works is a closely-kept trade secret.
  • However, all major search engines follow the
    general rules below.
  • Location, Location, Location...and Frequency
  • Location
  • Search engines will also check to see if the
    search keywords appear near the top of a web
    page, such as in the headline or in the first few
    paragraphs of text. They assume that any page
    relevant to the topic will mention those words
    right from the beginning.
  • Frequency
  • A search engine will analyze how often keywords
    appear in relation to other words in a web page.
    Those with a higher frequency are often deemed
    more relevant than other web pages.

18
Precision and Recall
  • two widely used measures for evaluating the
    quality of results in Information Retrieval
  • Precision
  • fraction of the documents retrieved that are
    relevant to the user's information need
  • number of relevant documents retrieved by a
    search ___________________________________________
    ________ the total number of documents retrieved
    by that search
  • Recall
  • the fraction of the documents that are relevant
    to the query that are successfully retrieved.
  • number of relevant documents retrieved by a
    search ___________________________________________
    __________ the total number of existing relevant
    documents which should have been retrieved
  • Often, there is an inverse relationship between
    Precision and Recall

19
Relevance and Ranking
  • webmasters constantly rewrite their web pages in
    an attempt to gain better rankings.
  • Some sophisticated webmasters may even "reverse
    engineer" the location/frequency systems used by
    a particular search engine
  • Because of this, all major search engines now
    also make use of "off the page" ranking criteria

20
Relevance and Ranking
  • Off the page factors
  • those that a webmasters cannot easily influence
  • Link analysis
  • Search engine analyzing how pages link to each
    other
  • Helps to determine what a page is about and
    whether that page is "important" and thus
    deserving of a ranking boost
  • Click through measurement
  • a search engine watch what results someone
    selects for a particular search,
  • eventually drop high-ranking pages that aren't
    attracting clicks,
  • promote lower-ranking pages that do pull in
    visitors.

21
Limitations in current web search engines
  • Centralized search engines have limited
    scalability.
  • crawler based indices are stale and incomplete
  • Fundamental issue How much of the web is
    crawlable
  • If you follow the rules many sites say robots
    get lost
  • What about Dynamic content? (Deep Web)
  • The deep web is around 500 times larger than
    surface web. These deep web resources mainly
    include data held by databases which can be
    accessed only through queries. Since crawlers
    discover resources only through links, they
    cannot discover these resources.
  • Theres no guarantee that current search engines
    index or even crawl the total surface web space

22
Limitations in current web search engines
  • Single point of failure
  • Ambiguous words
  • Polysemy - words with multiple meanings train
    car train neural network
  • Synonymy - multiple words same meaning neural
    network is trained as follows neural network
    learns as follows
  • What about phrases - searches are not bag of
    words
  • Positional information? Structural (throw out
    case punctuation)?
  • Non-text content data worth storing
  • Most web search engines today crawl only surface
    web.

23
P2P Web Search
  • Seen explosion of activity in the area of
    peer-to-peer (P2P) systems last few years
  • Since an increasing amount of content now resides
    in P2P networks, it becomes necessary to provide
    search facilities within P2P networks.
  • The significant computing resources provided by a
    P2P system could also be used to implement search
    and data mining functions for content located
    outside the system
  • e.g., for search and mining tasks across large
    intranets or global enterprises, or even to build
    a P2P-based alternative to the current major
    search engines.

24
P2P Web Search
  • The characteristics distinguish P2P systems from
    previous technologies
  • low maintenance overhead
  • improved scalability
  • Improved reliability
  • synergistic performance
  • increased autonomy and privacy
  • Dynamism

25
P2P Web Search Engines
  • YouSearch
  • Coopeer
  • ODISSEA

26
YouSearch
  • YouSearch
  • is a distributed search application for personal
    webservers operating within a shared context
  • Allow peers to aggregate into groups and users to
    search over specific groups
  • Goal
  • Provide fast, fresh and complete results to users

27
YouSearch
  • System Overview
  • participants in YouSearch
  • Peer-nodes
  • run YouSearch enabled clients
  • Browsers
  • search YouSearch enabled content through their
    web browsers
  • Registrar
  • centralized light-weight service that
  • acts like a blackboard on which peer nodes
    store and lookup (summarized) network state.

28
YouSearch
  • System Overview
  • Search System
  • Each peer node closely monitors its own content
    to maintain a fresh local index
  • A bloom filter content summary is created by each
    peer and pushed to the registrar.
  • When a browser issues a search query at a peer p
    , the peer p first queries the summaries at the
    registrar to obtain a set of peers R in the
    network that are hosting relevant documents.
  • The peers in R are then directly contacted by
    with the query to obtain the URLs for its
    results.
  • To quickly satisfy any subsequently issued
    queries with identical terms, the results from
    each query issued at a peer p are cached for a
    limited time at p

29
YouSearch
  • Indexing
  • Indexing is periodically executed at every peer
    node.
  • Inspector examines each shared file for its last
    modification date and time.
  • If the file is new or the file has changed, the
    file is passed to the Indexer.
  • The Indexer maintains a disk-based inverted-index
    over the shared content.
  • The name and path information of the file are
    indexed as well.

30
YouSearch
  • Indexing
  • Summarizer
  • The Summarizer obtains a list of terms T from the
    Indexer and creates a bloom filter from them in
    the following way.
  • A bit vector V of length L is created with each
    bit set to 0.
  • A specified hash function H with range 1,...,L
    is used to hash each term t in T and the bit at
    position H(t) in V is set to 1
  • YouSearch use k independent hash functions
    H1,H2,...,Hk and construct k different bloom
    filters, one for each hash function
  • In YouSearch,
  • the length of each bloom filter is L 64 Kbits
    and
  • the number of bloom filters k is set to 3
  • Summary Manager at the registrar aggregate these
    Bloom Filters into a structure that maps each bit
    position to a set of peers whose Bloom Filters
    have the corresponding bit set

31
YouSearch
  • Querying

query
query
computes the hash of keywords
keywords
determine
intersection of peer I s
looks up mapping
Corresponding bits of each k bloom filters
keywords
intersection of peer I s
results
Bit position to IP address mapping
contacts each of the peers in list and obtains a
list of URLs for matching documents
32
YouSearch
  • Caching
  • Every time a global query is answered that
    returns non-zero results, the querying peer
    caches the result set of URLs U (temporary)
  • The peer then informs the registrar of the fact.
  • The registrar adds a mapping from the query to
    the IP-address of the caching peer in its cache
    table

33
YouSearch
  • Limitations
  • False Positive results 17.38
  • Central registrar gtgt single point of failure
  • No extensive phrase search
  • No attention has been given for query ranking
  • No human user collaboration

34
Coopeer
  • Coopeer
  • Is a P2P web search engine where each user
    computer stores a part of the web model used for
    indexing and retrieving web resources in response
    to queries
  • Goal
  • complement centralized search engines to provide
    more humanized and personalized results by
    utilizing users collaboration

35
Coopeer
  • (a)Collaboration
  • One may look for interesting web pages in the P2P
    knowledge repository consisted with shared web
    pages.
  • A novel collaborative filtering technique called
    PeerRank is presented to rank pages proportional
    to the votes from relevant peers
  • (b)Humanization
  • Coopeer use a query-based representation for
    documents,
  • The relevant words are not directly extracted
    from page content but introduced by human users
    with a high proficiency in their expertise
    domains.
  • (c)Personalization
  • Similar users are self-organized according to
    their semantic content of search session.
  • Thus, requestor peer can extend routing paths
    along its neighbors, rather than just take a
    blind shot.
  • User-customized results can be obtained along
    personal routing paths in contrast with CSEs.

36
Coopeer
  • System Overview
  • requestor forwards the query based on the
    semantically routing.
  • Peers maintain a local index about the semantic
    content of remote peers.
  • Receiving a query message from remote peer,
    current peer check it against the local store.
  • In order to facilitate this work, a novel
    query-based representation about documents is
    introduced.
  • Based on query representation, cosine similarity
    between new query and documents can be computed.
  • the documents are relevant enough, if the
    similarity exceeds a certain threshold.
  • Then these results are returned to the requestor.
  • Receiving the returned results, the requestor
    peer need to rank them in term of preference of
    its human owner using PeerRank method.

37
Coopeer
  • The Coopeer client consists of four main software
    agents
  • The User Agent
  • is responsible for interacting with the users.
  • It provides a friendly user interface, so that
    users can conveniently manage and manipulate the
    whole search sessions.
  • The Web-searcher Agent
  • is the resource of P2P knowledge repository.
  • It performs the users individual searching with
    several search engines from the Internet.
  • The Collaborator Agent
  • is the key component for performing users
    real-time collaborative searching.
  • It facilitates maintaining the P2P knowledge
    repository, such as information sharing,
    searching, and fusion.
  • The Manager Agent
  • is the key component of Coopeer, which
    coordinates and manage the other types of agents.
  • It is also responsible for updating and
    maintaining data.

38
Coopeer
  • PeerRank
  • All the users are taken as a Referrer Network.
  • Determines pages relevance by examining a
    radiating network of referrers.
  • Documents with more referrers gain higher ranks.
  • Obtain better rank order, as collaborative
    evaluation of human users is much more precise
    than description of term frequency or link
    amount.
  • Prevent spam, since it is difficult to pretend
    evaluation from human users.

39
Coopeer
  • PeerRank
  • For a given search session, we firstly compute
    the similarity between requestors favorite lists
    and referrers,
  • then the similarity is used as the baseline of
    recommending degree of the referrer.
  • Firstly, as shown in equation (1), the similarity
    of local list and recommended list is given by
    the Kendall measure.
  • Secondly, we convert the rank of a given URL in
    its recommended list to a moderate score
  • R(e) - weight of URL e.
  • C (e) - set constituted by es referrers.
  • Z - constant gt 1.
  • p - local peer
  • Pi - a remote peer,
  • Lp , Lpi - list of p and Pi respectively.
  • K(r)(Lp, Lpi ) -Kendall function to measure the
    distance of the local list and the recommended
    list,
  • r decay factor.
  • SLpi(e) - score of e in the recommended list.
  • Re - rank of e and
  • RMax - highest rank of list pi, the length of
    the list.

40
Coopeer
  • Kendall Measure
  • Kendall is used to measure the distance between
    two lists in the same length.
  • Paper extend it to fit in with measuring two
    lists in different length.
  • Kendall function
  • t1 and t2 - two lists composed with URLs
  • Kr(t1, t2) -the distance between t1 and t2,
  • r fixed parameter with 0 r 1. C2
  • 2L - used for normalization is the possible
    maximum of the distance.
  • U(t1, t2) - set consists of all the URLs in t1
    and t2,
  • K ri,j(t1, t2) - means the penalty of the URL
    pair (i, j)

41
Coopeer
  • Query Based Representation
  • A novel type of representation based on the
    relevant words introduced by human users with a
    high proficiency in their expertise domains.
  • is efficient on the P2P platform, as the users
    evaluation can be utilized easily through the
    client application.
  • represent and organize the local documents for
    responding remote query

42
Coopeer
  • Each peer maintains
  • an inverted index table
  • represent local documents for responding remote
    query
  • the IDs of the documents that were replied the
    query
  • key of inverted index is terms extracted from the
    previous queries
  • Ex when peer j writes in two queries P2P
    Overlay and P2P Routing and obtains two set of
    documents, d1, d2, d3 and d3, d4
    respectively.
  • The retrieved documents will be updated with
    their corresponding query terms.
  • When any other peer issues a query about Overlay
    Routing Algorithm, peer j would look up relevant
    documents in the inverted index by using VSM
    cosine similarity as ranking algorithm, and d3
    would gain the highest ranking.

43
Coopeer
  • Semantic Routing Algorithm
  • each Coopeer client maintains a local Topic
    Neighbor Index
  • The index records the used performance of remote
    peers which has similar topics to the local peer.
  • These search sessions queries are used to
    represent the peers semantic content
  • session 1 gtgt is the local peer which has two
    topics (queries)
  • other sessions below denote the remote peers are
    interested in by the local peer in some aspect.
  • session 2 and 3 are relevant to P2P Routing
    topic of local peer, while others are about
    Pattern Recognition.
  • The peers on a same topic are in descending order
    of the rate.
  • The peers providing more interested resource
    would move to the top of an individuals local
    index

44
Coopeer
  • with query-based inverted index, the precision of
    matching results of different subjects was almost
    100
  • system uses information coming from centralized
    search engines, so the system is not aimed to
    replace CSEs, but to complement them.

45
Coopeer
  • Query based representation is Efficient in p2p
    because users evaluation can be utilized easily
    through the client application.
  • This is Inefficient in CSEs because gaining user
    evaluation through web browser is inefficient
    impractical to store and index documents every
    users query.
  • Prevent spam, since it is difficult to pretend
    evaluation from human users.
  • Use human searching experience ?better results

46
ODISSEA
  • A distributed global indexing and query execution
    service
  • Maintains a global index structure under document
    insertions and updates and node joins and
    failures
  • the inverted index for a particular term (word)
    is located at a single node, or partitioned over
    a small number of nodes in some hybrid
    organizations.
  • Assume two tier architecture.
  • The system is implemented on top of an underlying
    global address space provided by a DHT structure

47
ODISSEA
  • System provide the lower tier of the two tier
    architecture.
  • In the upper tier, there are two classes of
    clients that interact with this P2P-based lower
    tier
  • Update clients
  • insert new or updated documents into the system,
    which stores and indexes them.
  • An update client could be a crawler inserting
    crawled pages, a web server pushing documents
    into the index, or a node in a file sharing
    system.
  • Query clients
  • design optimized query execution plans, based on
    statistics about term frequencies and
    correlations, and issue them to the lower tier.

48
ODISSEA
49
ODISSEA
  • Global Index
  • An inverted index for a document collection is a
    data structure that contains for each word in the
    collection a list of all its occurrences, or a
    list of postings.
  • Each posting contains the document ID of the
    occurrence of the word, its position inside the
    document, and other information (in title? bold
    face?)
  • each node holds a complete global postings list
    for a subset of the words, as determined by a
    hash function.

50
ODISSEA
  • Query Processing
  • a ranking function is a function F that, given a
    query consisting of a set of search terms
    q0,q1,,qm-1 , assigns to each document d a score
    F(d, q0,q1,,qm-1) . The top- k ranking problem
    is then the problem of identifying the k
    documents in the collection with the highest
    scores.

51
ODISSEA
  • We focus on two families of ranking functions,
  • The first family includes the common families of
    term-based ranking functions used in IR, where we
    add up the scores of each document with respect
    to all words in the queries.
  • The second formula adds a query-independent value
    g(d) to the score of each page

52
ODISSEA
  • Fagins Algorithm
  • Consider the inverted lists for a search query
    with two terms q0 and q1 .
  • Assume they are located on the same machine, and
    that the postings in the list are pairs
    (d,f(d,qi)),i ?0,1, where d is an integer
    identifying the document and " f(d,qi) is real
    valued.
  • Assume each inverted list is sorted by the second
    attribute, so that documents with largest "
    f(d,qi) are at the start of the list.
  • Then the following algorithm, called FA, computes
    the top-k results

53
ODISSEA
  • FA
  • (1)Scan both lists from the beginning, by reading
    one element from each list in every step, until
    there are documents that have each been
    encountered in both of the lists.
  • (2) Compute the scores of these documents. Also,
    for each document that was encountered in only
    one of the lists, perform a lookup into the other
    list to determine the score of the document.
    Return the documents with the highest score.

54
Conclusion
  • Still no P2P web search engine has outperformed
    Google!
  • () Lot of resources for complex data mining
    tasks and for crawling whole surface web
  • ()Emergence of semantic communities also has a
    positive impact on p2p web search performance
  • (-)lack of global knowledge
  • (-)smart crawling strategies beyond BFS are hard
    to implement in a P2P environment without a
    centralized scheduler.

55
Some Open Problems
  • how to uniformly sample web pages on a web site
    if one does not have an exhaustive list of these
    pages?
  • Bar-Yosseff converted the web graph into an
    undirected, connected, and regular graph.
  • The equilibrium of a random walk on this graph is
    the uniform distribution.
  • It is not clear how many steps such a walk needs
    to perform.
  • A more significant problem, however, is that
    there is no reliable way of converting the web
    graph into an undirected graph.

56
Some Open Problems
  • Data Streams
  • The query logs of a web search engine contain all
    the queries issued at this search engine.
  • The most frequent queries change only slowly over
    time.
  • However, the queries with the largest increase or
    decrease from one time period over the next show
    interesting trends in user interests. We call
    them the top gainers and losers.
  • Since the number of queries is huge, the top
    gainers and losers need to be computed by making
    only one pass over the query logs.
  • This leads to the following data stream problem
  • Another interesting variant is to find all items
    above a certain frequency whose relative increase
    (i.e., their increase divided by their frequency
    in the first sequence) is the largest.

57
References
  • The anatomy of a large-scale hypertextual Web
    search engineSource Computer Networks and ISDN
    Systems Volume 30 ,  Issue 1-7  ,1998 Sergey Brin
    Lawrence Page
  • Make it fresh, make it quick searching a network
    of personal International World Wide Web
    Conference Budapest, Hungary , 2003
  • Towards a Fully Distributed P2P Web Search
    EngineProceedings of the 10th IEEE International
    Workshop on Future Trends of Distributed
    Computing Systems Jin Zhou, Kai Li and Li Tang
     2004
  • Odissea A peer-to-peer architecture for scalable
    web search and information retrieval by T Suel,
    C Mathur, J Wu, J Zhang, A Delis, M Kharrazi, X
    Long, K Shanmugasunderam , 2003
  • Space/time Trade-offs in Hash Coding with
    Allowable Errors B. Bloom. In Communications of
    ACM, volume 13(7), pages 422426, 1970
  • www.en.wikipedia.org

58
Extra Slides
59
Bloom Filters
  • a space-efficient probabilistic data structure
    that is used to test whether an element is a
    member of a set.
  • False positives are possible, but false negatives
    are not.
  • The more elements that are added to the set, the
    larger the probability of false positives.

60
Bloom Filters
  • An empty Bloom filter is a bit array of m bits,
    all set to 0.
  • There must also be k different hash functions
    defined, each of which maps a key value to one of
    the m array positions.
  • To add an element, feed it to each of the k hash
    functions to get k array positions. Set the bits
    at all these positions to 1.
  • To query for an element (test whether it is in
    the set), feed it to each of the k hash functions
    to get k array positions.
  • If any of the bits at these positions are 0, the
    element is not in the set if it were, then all
    the bits would have been set to 1 when it was
    inserted.
  • If all are 1, then either the element is in the
    set, or the bits have been set to 1 during the
    insertion of other elements.

61
Bloom Filters
An example of a Bloom filter, representing the
set x, y, z. The colored arrows show the
positions in the bit array that each set element
is mapped to. The element w, not in the set, is
detected as a nonmember as it is mapped to a
position containing a 0.
62
Bloom Filters
Write a Comment
User Comments (0)
About PowerShow.com