Title: Collaborative Search
1Collaborative Search
2- Traditional IR
- Web search
- Crawlers
- parallel crawler
- intelligent crawler
- Collaborative Search
- References
3Traditional IR
System
User
Acquisition documents, objects
Problem information need
Representation question
Representation indexing, ...
Database of Indexed documents
Query search formulation
Matching searching
Feedback
Retrieved objects
4Classic Information Retrieval
- Homogenous documents
- Well categorized
- Small well-controlled collection
- Closed, static environment
- Controlled collection growth
5Web Search
- Web
- - open, dynamic environment
- - vast uncontrolled collection of PAGES
- Web page
- - heterogeneous various formats, languages
- - content may change over time !
- Importance of LINKS
- Existing Search Facilities
- Generic yahoo, askjeeves, google etc.
- Specialized Pluribus,Collaborative Spider
6Common operations
- Indexing
- - identifies potential index terms in documents
- Query processing
- - form keywords
- Search
- - access indexed file
- Ranking
7Ranking
- Ranking is important
- Factors which influence rank
- Term location or frequency
- Proximity to query terms
- Date of Publication
- Length
- Popularity
- Heuristics Proper nouns may have higher weights
- WWW Link analysis Popularity (ex. Google)
8The Web indexing
- Web pages are heterogenous documents
- Contain both text information and meta
information - External meta information can be inferred
- Must be processed before the pertinence can be
established
9Indexing WWW documents
- Web pages require Preprocessing to get uniform
data structure - - Normalizes the document stream to a predefined
format - - Breaks the document stream into desired
retrievable units - - Isolates and metatags subdocument pieces
-
Web1
page1
Uniform format
Web2
page2
preprocessing
Web n
Page n
10Computing weights
- Assign weight to each descriptor for document
add to index - Weights are based on
- term frequency within the document (tf)
- Global term frequency within the corpus
- This will be a problem when using parallel
independent agents to do indexing
11IR on Web
Query
Search match
Indexed files
Query Processor
Page ranking
Document Processor
Responses
Browse
Web
Crawlers
Web pages
12Web Document discovery
- Corpus is very large
- Dynamic
- Open
- Documents must be discovered
- . use Web crawler
13Web Crawler
- What is a Crawler?
- init initial urls
-
- get next url scheduled urls
-
- Web
- get page visited urls
-
- extract urls
- web pages
-
14Parallel Crawler
- Advantages
- Faster.
- Imperative for large-scale crawling
- Can be run on cheaper machines
- Network load dispersion
- Network load reduction
Crawler1
Crawler2
Downloaded Web pages
Web
CrawlerN
Parallel Crawlers by Cho, Junghoo et al.
University of California, WWW2002, Honolulu,
Hawaii, USA
15Evaluation Metrics
- Overlap
- 1 - ( of unique pages downloaded / of page
downloaded by team of crawler) - Coverage
- of pages downloaded by the parallel crawler /
Total of reachable pages - Communication overhead
- of exchanged messages / of page downloads
16Assignment of search areas
- Partitioning the Web
- Address division .net, .ca , UdeM.ca
- Topic
- Static assignment ( see next page)
- Dynamic assignment (see multi-agent collaborative
search)
17Partition function
- Multitude of ways to partition the web
- Site-hashing
- Based on the hash value of the site name of a
URL - URL hashing
- Based on the hash value of all the URL
- Hierarchical
- partition the web hierarchically based on the
URLs of the pages -
- Partitionning will come up again with Agents !
18- Crawling modes (Examples)
- Firewall mode, Cross-over mode, Exchange mode
- Site1 (Crawler1)
Site2(Crawler2) - Parallel Crawlers by Cho, Junghoo et al.
University of California, Los Angeles WWW2002,
Honolulu, Hawaii, USA
a
f
b
c
g
d
i
h
e
19- Firewall mode download within partitions
- Crawler1 a?b, a?c
- Crawler2f?g, g?h, g?i
- Site1 (Crawler1)
Site2(Crawler2)
a
f
g
b
c
d
i
h
e
20 Cross over mode download between
partitions Crawler1 a?b, a?c a?g, g?h, h?d,
d?e, g?i Crawler2 f?g, g?h, g?i h?d, d?e
Site1 (Crawler1)
Site2(Crawler2)
a
f
g
b
c
d
i
h
e
21 Exchange mode download within partitions,
exchange info. Crawler1 a?b, a?c
then g ? Crawler2 Crawler2 f?g,
g?h, g?i then d ? Crawler1
Site1 (Crawler1) Site2(Crawler2)
a
f
g
b
c
d
i
h
e
22- Minimizing communication in Exchange Mode
- Batch communication
- Allow replication
- 1) Because links to pages follows a Zipf
distribution (... 20-80 factor) - 2) Replicate some popular URLs at each Crawlers
- Zipf distribution
- incoming links
incoming links
page
page
23Evaluating quality
- We want important pages
- Quality measure Pages ? Top_k /
Top_k - Pages downloaded k pages
- Topk top k most important pages
- Indication of importance backlink count
24- Comparison 2
- From experiments2
- 1) firewall mode parallel crawler number lt 4
less quality - 2) exchange mode small network traffic
maximize quality - 3) replicating between 10,000 100,000 (sic)
popular URLs reduces 40 commu. overhead
25Intelligent crawling
- Indiscriminate crawlers ( i.e. for Google)
- Any new page is good
- Topic-oriented crawlers
- I.e. Call for tenders
- We just want new pages on a topic of interest
- ?Intelligent crawler
Intelligent Crawling on the WWW with Arbitrary
Predicates, C. Aggarwal,et al., IBM TJ Watson
Res. Ctr., WWW10, Hong-Kong 2001
26Focused Crawling
- Which node to explore next ?
- Depth-first ? Breadth-first ?
- Best-first ! But what is best?
- Focused crawling is best, how to establish focus
? - -- Linkage locality -- Sibling
locality
topicY
X
topic X
topicY
X
topicY
...
Y
Y ?
Y
Y
27Focused Crawling
- Objective given a specific query, find
- -- Good sources of content (authorities)... many
links TO - -- Good sources of links (hubs) ... many links
FROM - authorities hubs
- Given a arbitrary query, can we auto-focus ?
- -- learning capability
- -- learning model
28Learning Model
- Analyze links from pages on the search periphery
- Learning how to pick good links to follow
- visited web page to visit page
hyperlink
1
2
C
3
4
29Learning Model
- Clues based on
- - content
- - URL tokens
- - linkage info
- - sibling structure
- Different needs require different learning
- - crawler need learning during the crawl
- - reuse learning information
- The Crawler should be intelligent
30Intelligent Crawling
- Priority list of URLs to be explored (Plist)
- User defined predicate to compute interest of
page ( processed query) - KB knowledge base
31Intelligent Crawling
- Algorithm Intelligent-Crawler()
- Begin
- Priority-List (PList ) Starting Seeds
- While not (termination) do
- begin
- Reorder URLs on PList using
KB - Drop unimportant items from
PList - W lt pop the first element
on PList - Fetch the Web page W
- Parse W and add all the
outlinks in W to PList - If W satisfies the
user-defined predicate, then store W - Update KB using content and
link information for W - end
- End
32Intelligent Crawler
- During the crawling process, we can accumulate
some information - Like
- number of URLs crawled, N1
- number of URLs crawled which satisfy predicate ,
N2 - pages in which word i occurs which satisfy the
predicate, N3 - pages with keyword in URL which satisfy (or
not) predicate . - How to create a KB?
- A later example will illustrate URL based learning
33Intelligent Crawler
Example User is interested in online
malls BUT only 0.1 web pages contain
online malls HOWEVER if word  eshop is in
URL then prob of page containing online
malls 5 Thus we should add to KB fact
that  eshop in URL is useful criterion in
choosing pages to explore.
34Formal view
C a crawled web page satisfies the given
predicate P(C) probability of event C, P(C)
N2 / N1 E a fact that we know about a
candidate URL Knowledge of the event E may
increase the probability P(C) thus P(CE)
P(C ? E) / P(E) P(CE) / P(C) P(C ? E) /
(P(C) P(E)) Calculate the interest ratio for
the event C given event E as IR(C,E) IR(C,E)
P(CE) / P(C) P(C ? E) / (P(C) P(E)) The
value of P(C ? E), P(E) can be calculated during
the crawling
from Intelligent Crawling on the WWW with
Arbitrary Predicates, C. Aggarwal,et al.,
35Mall example
- Example
- 0.1 web pages contain online malls satisfy (
P(C)) - if word  eshop occur ( E ) then the
probability (P(CE)) of satisfying increase to
5 - So interest ratio 5 / 0.1 50
- IR(C,E) P(CE) / P(C)
36Collaborative Search
- 3 ways search for information
- Browsing, querying and filtering
- Collaborative type 10
- Collaborative browsing
- Mediated searching
- Collaborative information filtering
- Collaborative agents
- Collaborative reuse of results
37Collaborative Search
- What do we mean by collaboration ?
- Human ? computer ? Human
- Human ?? Computer
- Computer agent ?? Computer agent
-
38Collaborative Search
- Man - machine
- Collaborative browsing --- Ariadne system 23
- Collaborative reuse of results --- Pluribus 21
(2000) - Collaborative information filtering ---
Collaborative filtering 25 - Mediated searching --- DIAMS 22 (2000)
- Machine - machine ( ? Collaborative agents )
- meta-search engines Meta Crawler, Mamma,
Metagopher, Copernic - topic-oriented collaborative crawler 11
(2002) - Collaborative spider 16 (2002)
- UbiCrawler 5 (2003)
- Collaborator 19 (under development)
39Existing systems
- meta-search engines
- Meta Crawler, Mamma, Metagopher, Copernic
- query --------- passes -----? to other search
engines - collect ?------ results -------- from other
search engines - combine ----- results ------?user
40Topic-oriented collaborative crawlers
11 (2002)
- Each crawler is given a specific topic
- It knows the topics of its colleagues
- It sends URLs of pages it doesnt care about to
the one responsible for the topic - Problems
- static predefined topic categories
- static assignment partition function,
- controller assign sites to each crawler
-
41Collaborative spiders 16 (2002)
- JATLite (Java Agent Template Lite),
- uses KQML,
- User agents ONE scheduler agent ,
- Collaborator agent (as a mediator)
- search, content mining,
- post-retrieval analysis system
- group user sharing information
-
42UbiCrawler 5 (2003)
consistent hashing partition function buckets
are agents, keys are hosts failure detector ---
only synchronous component each agent keeps
track of the visited URLs in a hash table pure
Java application, RMI based, multi-thread agent
43Collaborator 19 (under development)
- a shared workspace framework for virtual teams
- 3 tier architecture, J2EEAgent ( BlueJADE ),
- client tier, middle tier, enterprise information
systems tier - personal agents, session management agents
- desktop or wireless device
- Jade, FIPA
44Conclusion
- Current collaborative search
- - collaborative
- - dynamic
- - adaptive exploring
- - intelligent
- - decentralized
- Trend ? Agent
45Multi-agent collaborative search
- Challenges ?
- agent_1
-
-
agent_2 -
-
agent_n -
Query?
.
DataStore
.
DataStore
Web
.
DataStore
46Challenges
Partition dynamic ? - dynamic assigning the web
domain to agents Load balancing ? - each cache
stores roughly the same of pages Content look
up ? - an agent can easily locate the storage
that storing particular content Solution Web
Cache Consistent Hashing
47Web Caching
- Content (URL -gt content)
- For download efficiency
- Indexing information (Keyword -gt URL)
- Search efficiency
48Browser caching
1. For efficiency
www.abc.com 2. Each client has own
cache
caches
clients
49Proxy caches
1. each cache stores a subset of all pages
www.abc.com 2. each client knows several
caches
Domain caches
clients
50Agents web cache
communication
User
User
Web
agent
agent
agent
Web cache
Web cache
Web cache
51Content Look Up
- Summary cache
- Distributed hash
- Consistent hash
- Also achieves load balancing
- Partition dynamic
52- Summary cache
- Each cache knows the content of all the others
- C1 C2
C3 -
-
- F? C3
A, B, C C2D, E C3F, G
D, E C1A, B, C C3F, G
F, G C1A, B, C C2D, E
client
53Distributed hashing
- Distribute the work amongst many agents
- Efficient, O(1), determination of agent
responsible for a given KW or URL - Problem redistribution of data when number of
agents changes - Solution consistent hash ?
54Consistent Hashing
- Use standard Hash function H to map
- items 1,2,3,4,5 and agents A,B to a unit circle
- Map each item to closest cache
- - A holds 1,2,3
- - B holds 4,5
- 4 Web Caching with Consistent Hashing by
David Karger et al, MIT Lab
55Consistent Hashing
- To add a new agent C, hash the agent id
- Move the item close to it
- Other items dont move
- - A holds 3
- - B holds 4,5
- - C holds 1,2
- this example will be reused in partition dynamic
C
56Consistent Hashing
- Designed features
- - Load balancing each bucket stores roughly
same of pages - - Content look Up easily locate given key by
hash function H - - Smoothness little impact on hash bucket
contents when buckets - are added/removed
- Application of the consistent hashing
- Freenet 6, UbiCrawler 5
-
57Partition dynamic
- Suppose in above example items 1,2,3,4,5 are
the sites name of scheduled URLs, first only
have agents A, B to explore web, partition like
Web 1 2
4 3 5
Agent_A 1,2,3
Agent_B 4,5
58Partition dynamic
- After adding new agent C, reassign the web domain
to agents like
Agent_C 1,2
Web 1 2
4
3 5
Agent_A 3
Agent_B 4,5
59Concrete model
- Multiagent layer
- a general agent paradigm is not practical
- Agent type
- Interface agent
- collector agent
- Information agent
- Agent functionality
- interface agent interactively collects query
information with user - collector agent collects infor., forms
plan,results composition - information agent focused crawling with the
plan, form indexed files
60Concrete model
-
collaborative -
- query answer
User1
User2
User n
InferfaceAgent1
InferfaceAgent2
InferfaceAgent k
Collector Agent1
Collector Agent2
Collector Agent j
infoAgent1
infoAgent2
infoAgent m
Database1
Database2
Database m
61Concrete model
- infoAgent_1
- communication
- infoAgent_n
-
Local Storage
Indexing
Document processor
KB
Crawler
Web
Crawler
Document processor
Indexing
Local Storage
KB
62References
- 1 How a Search Engine Works by Elizabeth
Liddy School of Information Studies Syracuse
University - http//www.infotoday.com/searcher/may01/liddy.ht
m - 2 Parallel Crawlers by Cho, Junghoo
Garcia-Molina, Hector http//dbpubs.stanford.edu8
090/pub/2002-9 - 3 Mercator A Scalable, Extensible Web Crawler
- http//research.compaq.com/SRC/mercator/papers/ww
w/paper.html - 4 David Karger, Tom Leighton, Danny Lewin, and
Alex Sherman. Web caching with consistent
hashing. In Proc. of 8th International worldWide
Web Conference, Toronto, Canada, 1999 - 5 UbiCrawler A Scalable Fully Distributed Web
Crawler (2003) http//ausweb.scu.edu.au/aw02/paper
s/refereed/vigna/paper.html - 6 Freenet A Distributed Anonymous Information
Storage and Retrieval System http//citeseer.nj.ne
c.com/clarke00freenet.html
63- 7 LOOKING UP DATA IN P2P SYSTEMS by Hari
Balakrishnan et al - http//www.utsc.utoronto.ca/rosselet/cscd58s/
tut03/pres03/p2p-lookups.pdf - 8 Web Caching by Ion Stoica
- www.cs.berkeley.edu/istoica/cs268/notes/lecture2
1.pdf - 9 The Effects of Cooperation on Multiagent
Search in Task-Oriented Domains
http//citeseer.nj.nec.com/557884.html - 10 Collaborative Search and Retrieval Finding
Information Together - https//doc.telin.nl/dscgi/ds.py/Get/File-8269/Gi
gaCE-Collaborative_Search_and_Retrieval__Finding_I
nformation_Together.pdf - 11 Topic-Oriented Collaborative Crawling by
Chiasen Chung, Charles L.A. Clarke - http//citeseer.nj.nec.com/538331.html
- 12 Intelligent Crawling on the World Wide Web
with Arbitrary Predicates (2001)Â - http//citeseer.nj.nec.com/aggarwal01intelligent.
html - 13 Scaling Question Answering to the Web by
Cody Kwok et al http//www10.org/cdrom/papers/12
0/
64- 14 The Anatomy of a Large-Scale Hypertextual
Web Search Engine (1998) http//citeseer.nj.nec.co
m/brin98anatomy.html - 15 Design and evaluation of a multi-agent
collaborative Web Mining System (2003)Â - http//citeseer.nj.nec.com/chau03design.html
- 16 Text-learning and related intelligent agents
(1999) by Dunja Mladenic http//citeseer.nj.nec.co
m/mladenic99textlearning.html - 17 Text learning and related intelligent
agents by Dunja Mladenic - http//www.cs.cmu.edu/TextLeauning/pww/
- 18 Coordination of Multiple Intelligent
Software agents by Sycara, K., and Zeng, D - http//www.cs.cmu.edu/softagents/publications.ht
ml - 19 Enhancing Collaborative Work through Agents
by F. Bergenti et al - http//www-dii.ing.unisi.it/aiia2002/paper/
AGENTI/bergenti-aiia02.pdf - 20 Agents that Reduce Work and Information
Overload by Pattie Maes http//www.cs.brandeis.e
du/cs125a/content/agentsmaes.doc - 21 Collaboratively Searching the Web An
Initial by Agustin Schapira http//none.cs.umass.e
du/schapira/thesis/report/
65- 22 Collaborative Information Agents on the
World Wide Web by James R. Chen
http//ic.arc.nasa.gov/ic/projects/aim/papers/dl98
.pdf - 23 Collaborative browsing and visualisation of
the search process  - http//www.comp.lancs.ac.uk/computing/research/cs
eg/projects/ariadne/docs/elvira96.html - 24 Collaborative design that used the shared
cognitive space - www.jaist.ac.jp/library/thesis/is-master-2002/pap
er/t-kizaki/abstract.ps - 25 Collaborative Filtering by Berkeley Workshop
- http//www.sims.berkeley.edu/resources/collab/
-
66 Thanks !