Title: Effective Resource Discovery on the WorldWide Web
1Effective Resource Discovery on the World-Wide
Web
- Athanasios E. Papathanasiou
- M. Sc. Thesis,
- Computer Science Department
- University of Crete - Heraklion
- This work is supported by the USENIX Association
2?p?te?esµat??? ??a??t?s? ?????f???a? st??
?a???sµ?? ?st?
- ??a??s??? ?. ?apa?a?as???
- ?etapt???a?? ???as?a
- ?µ?µa ?p?st?µ?? ?p?????st??
- ?a?ep?st?µ?? ???t?? - ?????e??
- ? e??as?a a?t? ?p?st????eta? ap? t? USENIX
Association
3Layout
- Intro
- Information Overloading Problem
- USEwebNET
- Design - User Interface - Implementation
- Example of Use
- Extensions
- PaperFinder
- Design - User Interface
- Keyword based and Resource Discovery Mode
- Experimental Results
- Conclusions
4Information Explosion
- Terabytes of Information (May 99)
- ALTA-VISTA 150 million URLs
- Northern Light 140 million URLs
- INKTOMI 110 million URLs
- The above is a tiny fraction of the WWW
- more than 320 million pages.
- non-html information images, postscript,...
- dynamic html pages
- on-line shopping
- html gateways to databases
5More...
- Web Size doubles every 12-18 months
- searchenginewatch.internet.com
- How big is the web?
- www.neci.nj.nec.com/homepages/lawrence/websize.htm
l
6How much data is too much?
- Alta-Vista returns
- 61,000 hits about distributed systems
- 9,000,000 hits about java
- It requires 4 months to browse them at a rate of
one URL per second, 24 hours/day
7How much data is too much? (2)
- To make matters worse
- Even if one manages to read all the above
- the next time (s)he asks the same query
- it is the same data flood all over again
- Ranking does NOT help.
- synonymy (car - automobile)
- polysemy (jaguar car or jungle cat)
- spamming...
8Needle in a haystack?
- Information Retrieval is like
- looking for a needle in a haystack
- Solution
- Use the right tools
- X-ray glasses
9USEwebNET A Tool forEffective Resource
Discovery on the World-Wide Web
- Design - User Interface
- Implementation
- Example of use
- Extensions
10Traditional Model of Search
- 1. Send a query to a Search Engine
- 2. Receive thousands of URLs
- 3. Spend a few hours searching
- 4. Do some work
- 5. Go to 1.
- Single-shot search paradigm
- no memory of previous searches/findings
- We need Research tools
11Search becomes Research
- Human Tasks
- Browse Information and
- Discover Knowledge
- Computer Tasks
- Repeatedly inquire search engines
- show humans new information
- hide previously-seen information
12Re-searching the Web
- Users define topics of interest (cf. Newsgroups)
- Java Programming
- Digital Libraries
- Computers
- repeatedly inquire search engines
- present new information to users
- present updated information
- recognize modifications
- remember what URLs have been visited
13USENET News
- Computers
- periodically gather recent news articles
- remember which articles have been read
- Humans
- define newsgroups
- write and publish articles
- read only recent articles
- staying up-to-date in a field
14USENETweb USEwebNET
- User Queries become Newsgroups
- distributed systems
- remote memory paging
15USENETweb USEwebNET
- Users interests become newsgroups
- remote memory
- URLs become published news artciles
16User Interface Design
17Back-End Engine
- Result Fetcher Parser
- Database update
- E-Mail Notification
- Crontab Manager
- Difference Engine
- ATT Labs htmldiff
18Back-End Engine Architecture
Check and compare versions
19Example of USE
20Example of USE
- All user preferences
- Results
21Example of USE
22A complete screenshot
23Extensions
- Digital Libraries
- keep me posted about papers on Java
- E-commerce
- let me know about 1967 Tbirds
- Keeping up-to-date
- let me know if something new about Evangelos
Markatos is published on the web
24PaperFinder A Tool for Scalable Search of
Digital Libraries
- Design - User Interface
- Keyword Based Mode
- Resource Discovery Mode
- Experimental Results
25Keeping up-to-date
- Scientists want to stay up-to-date
- they go to conferences
- they subscribe to journals
- they go to libraries
- Being up-to-date is difficult
- too many conferences/journals
- If only there were a tool to find and deliver
interesting papers.
26PaperFinder
- Paperfinder finds and reports interesting
papers - Operates on top of popular DLs
- Keyword search mode
- Finds papers for distributed systems
- Resource Discovery Mode
- Finds papers related to a seed paper
27Paperfinder Keyword search
- User defines keyword search
- e.g. find papers containing the term distributed
systems - Paperfinder inquires Digital Libraries
- it returns relevant papers
- it remembers papers read so far
- User invokes Paperfinder to find out whats new
28PaperFinder Keyword Search
29Resource Discovery
- User specifies seed paper
- Paperfinder finds related papers
- Query generalization
- create new queries that return related papers
- Result filtering
- sort the results according to relevance
30Query Generalization How?
- Keyword Extraction from titles/astract/text.
- Semantic network find synonyms.
- Princeton University
- Bibliographic references.
- CiteSeer
- Use of subject descriptors.
- ACM - PaperFinder
- Seed authors
- PaperFinder
31Resource filtering
- Filtering papers according to relevance
- sort papers according to their author distance
from the seed paper - various other metrics could be possible
- cf. Erdos Number
- Co-authors of Erdos have Erdos Number 0
- People who have written a paper with a co-author
of Erdos have Erdos Number 1 - Similarity Metric
- Avg(min/max) of author disctances
32First Experimental Results
- Filter At least as good as ACM.
- Room for Improving other distance metrics
- Identify the work of distinctive research groups
- IDEA clusters of cooperating research groups
- exploit links among
- reveal new articles.
- that other methods couldnt discover
33Other distance metrics
- Weighted Author Distance
- No. of papers they have co-authored over the
total number of papers they have authored - Keyword Distance
- Given two papers and a set of keywords
- Paper distance No. keywords in both papers
34Current database
- 9,000 bibliographic references
- 10,000 authors
- 100,000,000 author pairs
- max distance 20
- distant relations among authors
35Experiment 1
Very relevant PaperFinder 7/20 - ACM
5/20 Not relevant PaperFinder 8/20 - ACM
11/20 Seed T. Anderson, M. Dahlin, J. Neefe,
D. Patterson, D. Roselli, R. Wang. Serverless
Network File Systems. ACM Trans. on Comp.
Systems, Feb. 1996. Topic distributed OS,
distributed file systems, reliability,
availability
36Experiment 2
Identifying clusters of cooperating research
groups. Seed E. M. Chaves, T. J. LeBlanc, B. D.
Marsh, M. L. Scott, Kernel-kernel communication
in a Shared-Memory multiprocessor. Symposium on
Experiences and with Distributed and
multiprocessor Systems.
37Summary-Conclusions
- Searching needs good tools
- Researching needs even better
- USEwebNET combines
- the information wealth of the web
- with the user-interface of USENET
- http//cuckoo.ics.forth.gr9002
38Summary-Conclusions
- Paperfinder keeps scientists up-to-date in their
field - Paperfinder combines
- known keyword-based query mode
- Resource-discovery mode
- memory to remember papers read
- friendly USENET-like interface
39Summary-Conclusions
- USEwebNET - PaperFinder
- Popular user-interface of USENET News
- Whats new?
- Hide known information
- Offload busy web servers
- PaperFinder
- Filtering Presents most interesting first
- The filters evolves
- as new knowledge becomes available
40Related Work (USEwebNET)
- ARCHIE (ftp), Veronica (gopher)
- Netfind (Colorado)
- Search engines (Alta-Vista, ...)
- Meta-searching (MetaCrawler)
- The Informant (Dartmouth)
- SenseMaker (Stanford)
- Letizia (MIT)
41Related Work (PaperFinder)
- CiteSeer (NEC)
- Dienst
- ACM