Collaborative Search - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Collaborative Search

Description:

Parallel Crawlers by Cho, Junghoo et al. University of California, WWW2002, ... 1) firewall mode : parallel crawler number 4 & less quality ... crawler ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 67
Provided by: iroUmo
Category:

less

Transcript and Presenter's Notes

Title: Collaborative Search


1
Collaborative Search
  • Zheng Zhen

2
  • Traditional IR
  • Web search
  • Crawlers
  • parallel crawler
  • intelligent crawler
  • Collaborative Search
  • References

3
Traditional IR

System
User
Acquisition documents, objects
Problem information need
Representation question
Representation indexing, ...
Database of Indexed documents
Query search formulation
Matching searching
Feedback
Retrieved objects
4
Classic Information Retrieval
  • Homogenous documents
  • Well categorized
  • Small well-controlled collection
  • Closed, static environment
  • Controlled collection growth

5
Web Search
  • Web
  • - open, dynamic environment
  • - vast uncontrolled collection of PAGES
  • Web page
  • - heterogeneous various formats, languages
  • - content may change over time !
  • Importance of LINKS
  • Existing Search Facilities
  • Generic yahoo, askjeeves, google etc.
  • Specialized Pluribus,Collaborative Spider

6
Common operations
  • Indexing
  • - identifies potential index terms in documents
  • Query processing
  • - form keywords
  • Search
  • - access indexed file
  • Ranking

7
Ranking
  • Ranking is important
  • Factors which influence rank
  • Term location or frequency
  • Proximity to query terms
  • Date of Publication
  • Length
  • Popularity
  • Heuristics Proper nouns may have higher weights
  • WWW Link analysis Popularity (ex. Google)

8
The Web indexing
  • Web pages are heterogenous documents
  • Contain both text information and meta
    information
  • External meta information can be inferred
  • Must be processed before the pertinence can be
    established

9
Indexing WWW documents
  • Web pages require Preprocessing to get uniform
    data structure
  • - Normalizes the document stream to a predefined
    format
  • - Breaks the document stream into desired
    retrievable units
  • - Isolates and metatags subdocument pieces

Web1
page1
Uniform format
Web2
page2
preprocessing
Web n
Page n
10
Computing weights
  • Assign weight to each descriptor for document
    add to index
  • Weights are based on
  • term frequency within the document (tf)
  • Global term frequency within the corpus
  • This will be a problem when using parallel
    independent agents to do indexing

11
IR on Web

Query
Search match
Indexed files
Query Processor
Page ranking
Document Processor
Responses
Browse
Web
Crawlers
Web pages
12
Web Document discovery
  • Corpus is very large
  • Dynamic
  • Open
  • Documents must be discovered
  • . use Web crawler

13
Web Crawler
  • What is a Crawler?
  • init initial urls
  • get next url scheduled urls
  • Web
  • get page visited urls
  • extract urls
  • web pages

14
Parallel Crawler
  • Advantages
  • Faster.
  • Imperative for large-scale crawling
  • Can be run on cheaper machines
  • Network load dispersion
  • Network load reduction

Crawler1
Crawler2
Downloaded Web pages
Web
CrawlerN
Parallel Crawlers by Cho, Junghoo et al.
University of California, WWW2002, Honolulu,
Hawaii, USA
15
Evaluation Metrics
  • Overlap
  • 1 - ( of unique pages downloaded / of page
    downloaded by team of crawler)
  • Coverage
  • of pages downloaded by the parallel crawler /
    Total of reachable pages
  • Communication overhead
  • of exchanged messages / of page downloads

16
Assignment of search areas
  • Partitioning the Web
  • Address division .net, .ca , UdeM.ca
  • Topic
  • Static assignment ( see next page)
  • Dynamic assignment (see multi-agent collaborative
    search)

17
Partition function
  • Multitude of ways to partition the web
  • Site-hashing
  • Based on the hash value of the site name of a
    URL
  • URL hashing
  • Based on the hash value of all the URL
  • Hierarchical
  • partition the web hierarchically based on the
    URLs of the pages
  • Partitionning will come up again with Agents !

18
  • Crawling modes (Examples)
  • Firewall mode, Cross-over mode, Exchange mode
  • Site1 (Crawler1)
    Site2(Crawler2)
  • Parallel Crawlers by Cho, Junghoo et al.
    University of California, Los Angeles WWW2002,
    Honolulu, Hawaii, USA

a
f
b
c
g
d
i
h
e
19
  • Firewall mode download within partitions
  • Crawler1 a?b, a?c
  • Crawler2f?g, g?h, g?i
  • Site1 (Crawler1)
    Site2(Crawler2)


a
f
g
b
c
d
i
h
e
  • D and E are overlooked !

20
Cross over mode download between
partitions Crawler1 a?b, a?c a?g, g?h, h?d,
d?e, g?i Crawler2 f?g, g?h, g?i h?d, d?e
Site1 (Crawler1)
Site2(Crawler2)


a
f
g
b
c
d
i
h
e
  • Duplication of work !

21
Exchange mode download within partitions,
exchange info. Crawler1 a?b, a?c
then g ? Crawler2 Crawler2 f?g,
g?h, g?i then d ? Crawler1
Site1 (Crawler1) Site2(Crawler2)



a
f
g
b
c
d
i
h
e
  • Requires communication

22
  • Minimizing communication in Exchange Mode
  • Batch communication
  • Allow replication
  • 1) Because links to pages follows a Zipf
    distribution (... 20-80 factor)
  • 2) Replicate some popular URLs at each Crawlers
  • Zipf distribution
  • incoming links

    incoming links

page
page
23
Evaluating quality
  • We want important pages
  • Quality measure Pages ? Top_k /
    Top_k
  • Pages downloaded k pages
  • Topk top k most important pages
  • Indication of importance backlink count

24
  • Comparison 2
  • From experiments2
  • 1) firewall mode parallel crawler number lt 4
    less quality
  • 2) exchange mode small network traffic
    maximize quality
  • 3) replicating between 10,000 100,000 (sic)
    popular URLs reduces 40 commu. overhead

25
Intelligent crawling
  • Indiscriminate crawlers ( i.e. for Google)
  • Any new page is good
  • Topic-oriented crawlers
  • I.e. Call for tenders
  • We just want new pages on a topic of interest
  • ?Intelligent crawler

Intelligent Crawling on the WWW with Arbitrary
Predicates, C. Aggarwal,et al., IBM TJ Watson
Res. Ctr., WWW10, Hong-Kong 2001
26
Focused Crawling
  • Which node to explore next ?
  • Depth-first ? Breadth-first ?
  • Best-first ! But what is best?
  • Focused crawling is best, how to establish focus
    ?
  • -- Linkage locality -- Sibling
    locality

topicY
X
topic X
topicY
X
topicY
...
Y
Y ?
Y
Y
27
Focused Crawling
  • Objective given a specific query, find
  • -- Good sources of content (authorities)... many
    links TO
  • -- Good sources of links (hubs) ... many links
    FROM
  • authorities hubs
  • Given a arbitrary query, can we auto-focus ?
  • -- learning capability
  • -- learning model

28
Learning Model
  • Analyze links from pages on the search periphery
  • Learning how to pick good links to follow
  • visited web page to visit page
    hyperlink

1
2
C
3
4
29
Learning Model
  • Clues based on
  • - content
  • - URL tokens
  • - linkage info
  • - sibling structure
  • Different needs require different learning
  • - crawler need learning during the crawl
  • - reuse learning information
  • The Crawler should be intelligent

30
Intelligent Crawling
  • Priority list of URLs to be explored (Plist)
  • User defined predicate to compute interest of
    page ( processed query)
  • KB knowledge base

31
Intelligent Crawling
  • Algorithm Intelligent-Crawler()
  • Begin
  • Priority-List (PList ) Starting Seeds
  • While not (termination) do
  • begin
  • Reorder URLs on PList using
    KB
  • Drop unimportant items from
    PList
  • W lt pop the first element
    on PList
  • Fetch the Web page W
  • Parse W and add all the
    outlinks in W to PList
  • If W satisfies the
    user-defined predicate, then store W
  • Update KB using content and
    link information for W
  • end
  • End

32
Intelligent Crawler
  • During the crawling process, we can accumulate
    some information
  • Like
  • number of URLs crawled, N1
  • number of URLs crawled which satisfy predicate ,
    N2
  • pages in which word i occurs which satisfy the
    predicate, N3
  • pages with keyword in URL which satisfy (or
    not) predicate .
  • How to create a KB?
  • A later example will illustrate URL based learning

33
Intelligent Crawler
Example User is interested in online
malls BUT only 0.1 web pages contain
online malls HOWEVER if word  eshop is in
URL then prob of page containing online
malls 5 Thus we should add to KB fact
that  eshop  in URL is useful criterion in
choosing pages to explore.
34
Formal view
C a crawled web page satisfies the given
predicate P(C) probability of event C, P(C)
N2 / N1 E a fact that we know about a
candidate URL Knowledge of the event E may
increase the probability P(C) thus P(CE)
P(C ? E) / P(E) P(CE) / P(C) P(C ? E) /
(P(C) P(E)) Calculate the interest ratio for
the event C given event E as IR(C,E) IR(C,E)
P(CE) / P(C) P(C ? E) / (P(C) P(E)) The
value of P(C ? E), P(E) can be calculated during
the crawling
from Intelligent Crawling on the WWW with
Arbitrary Predicates, C. Aggarwal,et al.,
35
Mall example
  • Example
  • 0.1 web pages contain online malls satisfy (
    P(C))
  • if word  eshop occur ( E ) then the
    probability (P(CE)) of satisfying increase to
    5
  • So interest ratio 5 / 0.1 50
  • IR(C,E) P(CE) / P(C)

36
Collaborative Search
  • 3 ways search for information
  • Browsing, querying and filtering
  • Collaborative type 10
  • Collaborative browsing
  • Mediated searching
  • Collaborative information filtering
  • Collaborative agents
  • Collaborative reuse of results

37
Collaborative Search
  • What do we mean by collaboration ?
  • Human ? computer ? Human
  • Human ?? Computer
  • Computer agent ?? Computer agent

38
Collaborative Search
  • Man - machine
  • Collaborative browsing --- Ariadne system 23
  • Collaborative reuse of results --- Pluribus 21
    (2000)
  • Collaborative information filtering ---
    Collaborative filtering 25
  • Mediated searching --- DIAMS 22 (2000)
  • Machine - machine ( ? Collaborative agents )
  • meta-search engines Meta Crawler, Mamma,
    Metagopher, Copernic
  • topic-oriented collaborative crawler 11
    (2002)
  • Collaborative spider 16 (2002)
  • UbiCrawler 5 (2003)
  • Collaborator 19 (under development)

39
Existing systems
  • meta-search engines
  • Meta Crawler, Mamma, Metagopher, Copernic
  • query --------- passes -----? to other search
    engines
  • collect ?------ results -------- from other
    search engines
  • combine ----- results ------?user

40
Topic-oriented collaborative crawlers
11 (2002)
  • Each crawler is given a specific topic
  • It knows the topics of its colleagues
  • It sends URLs of pages it doesnt care about to
    the one responsible for the topic
  • Problems
  • static predefined topic categories
  • static assignment partition function,
  • controller assign sites to each crawler

41
Collaborative spiders 16 (2002)
  • JATLite (Java Agent Template Lite),
  • uses KQML,
  • User agents ONE scheduler agent ,
  • Collaborator agent (as a mediator)
  • search, content mining,
  • post-retrieval analysis system
  • group user sharing information

42
UbiCrawler 5 (2003)
consistent hashing partition function buckets
are agents, keys are hosts failure detector ---
only synchronous component each agent keeps
track of the visited URLs in a hash table pure
Java application, RMI based, multi-thread agent
43
Collaborator 19 (under development)
  • a shared workspace framework for virtual teams
  • 3 tier architecture, J2EEAgent ( BlueJADE ),
  • client tier, middle tier, enterprise information
    systems tier
  • personal agents, session management agents
  • desktop or wireless device
  • Jade, FIPA

44
Conclusion
  • Current collaborative search
  • - collaborative
  • - dynamic
  • - adaptive exploring
  • - intelligent
  • - decentralized
  • Trend ? Agent

45
Multi-agent collaborative search
  • Challenges ?
  • agent_1

  • agent_2

  • agent_n

Query?
.
DataStore
.
DataStore
Web
.
DataStore
46
Challenges
Partition dynamic ? - dynamic assigning the web
domain to agents Load balancing ? - each cache
stores roughly the same of pages Content look
up ? - an agent can easily locate the storage
that storing particular content Solution Web
Cache Consistent Hashing
47
Web Caching
  • Content (URL -gt content)
  • For download efficiency
  • Indexing information (Keyword -gt URL)
  • Search efficiency

48
Browser caching
1. For efficiency
www.abc.com 2. Each client has own
cache
caches
clients
49
Proxy caches
1. each cache stores a subset of all pages
www.abc.com 2. each client knows several
caches
Domain caches
clients
50
Agents web cache

communication
User
User
Web
agent
agent
agent
Web cache
Web cache
Web cache
51
Content Look Up
  • Summary cache
  • Distributed hash
  • Consistent hash
  • Also achieves load balancing
  • Partition dynamic

52
  • Summary cache
  • Each cache knows the content of all the others
  • C1 C2
    C3

  • F? C3

A, B, C C2D, E C3F, G
D, E C1A, B, C C3F, G
F, G C1A, B, C C2D, E
client
53
Distributed hashing
  • Distribute the work amongst many agents
  • Efficient, O(1), determination of agent
    responsible for a given KW or URL
  • Problem redistribution of data when number of
    agents changes
  • Solution consistent hash ?

54
Consistent Hashing
  • Use standard Hash function H to map
  • items 1,2,3,4,5 and agents A,B to a unit circle
  • Map each item to closest cache
  • - A holds 1,2,3
  • - B holds 4,5
  • 4 Web Caching with Consistent Hashing by
    David Karger et al, MIT Lab

55
Consistent Hashing
  • To add a new agent C, hash the agent id
  • Move the item close to it
  • Other items dont move
  • - A holds 3
  • - B holds 4,5
  • - C holds 1,2
  • this example will be reused in partition dynamic

C
56
Consistent Hashing
  • Designed features
  • - Load balancing each bucket stores roughly
    same of pages
  • - Content look Up easily locate given key by
    hash function H
  • - Smoothness little impact on hash bucket
    contents when buckets
  • are added/removed
  • Application of the consistent hashing
  • Freenet 6, UbiCrawler 5

57
Partition dynamic
  • Suppose in above example items 1,2,3,4,5 are
    the sites name of scheduled URLs, first only
    have agents A, B to explore web, partition like

Web 1 2


4 3 5
Agent_A 1,2,3
Agent_B 4,5
58
Partition dynamic
  • After adding new agent C, reassign the web domain
    to agents like

Agent_C 1,2
Web 1 2


4
3 5
Agent_A 3
Agent_B 4,5
59
Concrete model
  • Multiagent layer
  • a general agent paradigm is not practical
  • Agent type
  • Interface agent
  • collector agent
  • Information agent
  • Agent functionality
  • interface agent interactively collects query
    information with user
  • collector agent collects infor., forms
    plan,results composition
  • information agent focused crawling with the
    plan, form indexed files

60
Concrete model

  • collaborative

  • query answer

User1
User2
User n
InferfaceAgent1
InferfaceAgent2
InferfaceAgent k
Collector Agent1
Collector Agent2
Collector Agent j
infoAgent1
infoAgent2
infoAgent m
Database1
Database2
Database m
61
Concrete model
  • infoAgent_1
  • communication
  • infoAgent_n

Local Storage
Indexing
Document processor
KB
Crawler
Web
Crawler
Document processor
Indexing
Local Storage
KB
62
References
  • 1 How a Search Engine Works by Elizabeth
    Liddy School of Information Studies Syracuse
    University 
  • http//www.infotoday.com/searcher/may01/liddy.ht
    m
  • 2 Parallel Crawlers by Cho, Junghoo
    Garcia-Molina, Hector http//dbpubs.stanford.edu8
    090/pub/2002-9
  • 3 Mercator A Scalable, Extensible Web Crawler
  • http//research.compaq.com/SRC/mercator/papers/ww
    w/paper.html
  • 4 David Karger, Tom Leighton, Danny Lewin, and
    Alex Sherman. Web caching with consistent
    hashing. In Proc. of 8th International worldWide
    Web Conference, Toronto, Canada, 1999
  • 5 UbiCrawler A Scalable Fully Distributed Web
    Crawler (2003) http//ausweb.scu.edu.au/aw02/paper
    s/refereed/vigna/paper.html
  • 6 Freenet A Distributed Anonymous Information
    Storage and Retrieval System http//citeseer.nj.ne
    c.com/clarke00freenet.html

63
  • 7 LOOKING UP DATA IN P2P SYSTEMS by Hari
    Balakrishnan et al
  • http//www.utsc.utoronto.ca/rosselet/cscd58s/
    tut03/pres03/p2p-lookups.pdf
  • 8 Web Caching by Ion Stoica
  • www.cs.berkeley.edu/istoica/cs268/notes/lecture2
    1.pdf
  • 9 The Effects of Cooperation on Multiagent
    Search in Task-Oriented Domains
    http//citeseer.nj.nec.com/557884.html
  • 10 Collaborative Search and Retrieval Finding
    Information Together
  • https//doc.telin.nl/dscgi/ds.py/Get/File-8269/Gi
    gaCE-Collaborative_Search_and_Retrieval__Finding_I
    nformation_Together.pdf
  • 11 Topic-Oriented Collaborative Crawling by
    Chiasen Chung, Charles L.A. Clarke
  • http//citeseer.nj.nec.com/538331.html
  • 12 Intelligent Crawling on the World Wide Web
    with Arbitrary Predicates (2001) 
  • http//citeseer.nj.nec.com/aggarwal01intelligent.
    html
  • 13 Scaling Question Answering to the Web by
    Cody Kwok et al http//www10.org/cdrom/papers/12
    0/

64
  • 14 The Anatomy of a Large-Scale Hypertextual
    Web Search Engine (1998) http//citeseer.nj.nec.co
    m/brin98anatomy.html
  • 15 Design and evaluation of a multi-agent
    collaborative Web Mining System (2003) 
  • http//citeseer.nj.nec.com/chau03design.html
  • 16 Text-learning and related intelligent agents
    (1999) by Dunja Mladenic http//citeseer.nj.nec.co
    m/mladenic99textlearning.html
  • 17 Text learning and related intelligent
    agents by Dunja Mladenic
  • http//www.cs.cmu.edu/TextLeauning/pww/
  • 18 Coordination of Multiple Intelligent
    Software agents by Sycara, K., and Zeng, D
  • http//www.cs.cmu.edu/softagents/publications.ht
    ml
  • 19 Enhancing Collaborative Work through Agents
    by F. Bergenti et al
  • http//www-dii.ing.unisi.it/aiia2002/paper/
    AGENTI/bergenti-aiia02.pdf
  • 20 Agents that Reduce Work and Information
    Overload by Pattie Maes http//www.cs.brandeis.e
    du/cs125a/content/agentsmaes.doc
  • 21 Collaboratively Searching the Web An
    Initial by Agustin Schapira http//none.cs.umass.e
    du/schapira/thesis/report/

65
  • 22 Collaborative Information Agents on the
    World Wide Web by James R. Chen
    http//ic.arc.nasa.gov/ic/projects/aim/papers/dl98
    .pdf
  • 23 Collaborative browsing and visualisation of
    the search process  
  • http//www.comp.lancs.ac.uk/computing/research/cs
    eg/projects/ariadne/docs/elvira96.html
  • 24 Collaborative design that used the shared
    cognitive space
  • www.jaist.ac.jp/library/thesis/is-master-2002/pap
    er/t-kizaki/abstract.ps
  • 25 Collaborative Filtering by Berkeley Workshop
  • http//www.sims.berkeley.edu/resources/collab/

66
Thanks !
Write a Comment
User Comments (0)
About PowerShow.com