Effective Resource Discovery on the WorldWide Web - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Effective Resource Discovery on the WorldWide Web

Description:

ALTA-VISTA: 150 million URLs. Northern Light: 140 million URLs ... Alta-Vista returns: 61,000 hits about 'distributed systems' 9,000,000 hits about 'java' ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 42

Provided by: thanospapa

Category:

more less

Transcript and Presenter's Notes

Title: Effective Resource Discovery on the WorldWide Web

1
Effective Resource Discovery on the World-Wide
Web

Athanasios E. Papathanasiou
M. Sc. Thesis,
Computer Science Department
University of Crete - Heraklion
This work is supported by the USENIX Association

2
?p?te?esµat??? ??a??t?s? ?????f???a? st??
?a???sµ?? ?st?

??a??s??? ?. ?apa?a?as???
?etapt???a?? ???as?a
?µ?µa ?p?st?µ?? ?p?????st??
?a?ep?st?µ?? ???t?? - ?????e??
? e??as?a a?t? ?p?st????eta? ap? t? USENIX
Association

3
Layout

Intro
Information Overloading Problem
USEwebNET
Design - User Interface - Implementation
Example of Use
Extensions
PaperFinder
Design - User Interface
Keyword based and Resource Discovery Mode
Experimental Results
Conclusions

4
Information Explosion

Terabytes of Information (May 99)
ALTA-VISTA 150 million URLs
Northern Light 140 million URLs
INKTOMI 110 million URLs
The above is a tiny fraction of the WWW
more than 320 million pages.
non-html information images, postscript,...
dynamic html pages
on-line shopping
html gateways to databases

5
More...

Web Size doubles every 12-18 months
searchenginewatch.internet.com
How big is the web?
www.neci.nj.nec.com/homepages/lawrence/websize.htm
l

6
How much data is too much?

Alta-Vista returns
61,000 hits about distributed systems
9,000,000 hits about java
It requires 4 months to browse them at a rate of
one URL per second, 24 hours/day

7
How much data is too much? (2)

To make matters worse
Even if one manages to read all the above
the next time (s)he asks the same query
it is the same data flood all over again
Ranking does NOT help.
synonymy (car - automobile)
polysemy (jaguar car or jungle cat)
spamming...

8
Needle in a haystack?

Information Retrieval is like
looking for a needle in a haystack
Solution
Use the right tools
X-ray glasses

9
USEwebNET A Tool forEffective Resource
Discovery on the World-Wide Web

Design - User Interface
Implementation
Example of use
Extensions

10
Traditional Model of Search

1. Send a query to a Search Engine
2. Receive thousands of URLs
3. Spend a few hours searching
4. Do some work
5. Go to 1.
Single-shot search paradigm
no memory of previous searches/findings
We need Research tools

11
Search becomes Research

Human Tasks
Browse Information and
Discover Knowledge
Computer Tasks
Repeatedly inquire search engines
show humans new information
hide previously-seen information

12
Re-searching the Web

Users define topics of interest (cf. Newsgroups)
Java Programming
Digital Libraries
Computers
repeatedly inquire search engines
present new information to users
present updated information
recognize modifications
remember what URLs have been visited

13
USENET News

Computers
periodically gather recent news articles
remember which articles have been read
Humans
define newsgroups
write and publish articles
read only recent articles
staying up-to-date in a field

14
USENETweb USEwebNET

User Queries become Newsgroups
distributed systems
remote memory paging

15
USENETweb USEwebNET

Users interests become newsgroups
remote memory
URLs become published news artciles

16
User Interface Design
17
Back-End Engine

Result Fetcher Parser
Database update
E-Mail Notification
Crontab Manager
Difference Engine
ATT Labs htmldiff

18
Back-End Engine Architecture
Check and compare versions
19
Example of USE

Define a preference

20
Example of USE

All user preferences
Results

21
Example of USE

Reading/Saving URLs

22
A complete screenshot
23
Extensions

Digital Libraries
keep me posted about papers on Java
E-commerce
let me know about 1967 Tbirds
Keeping up-to-date
let me know if something new about Evangelos
Markatos is published on the web

24
PaperFinder A Tool for Scalable Search of
Digital Libraries

Design - User Interface
Keyword Based Mode
Resource Discovery Mode
Experimental Results

25
Keeping up-to-date

Scientists want to stay up-to-date
they go to conferences
they subscribe to journals
they go to libraries
Being up-to-date is difficult
too many conferences/journals
If only there were a tool to find and deliver
interesting papers.

26
PaperFinder

Paperfinder finds and reports interesting
papers
Operates on top of popular DLs
Keyword search mode
Finds papers for distributed systems
Resource Discovery Mode
Finds papers related to a seed paper

27
Paperfinder Keyword search

User defines keyword search
e.g. find papers containing the term distributed
systems
Paperfinder inquires Digital Libraries
it returns relevant papers
it remembers papers read so far
User invokes Paperfinder to find out whats new

28
PaperFinder Keyword Search
29
Resource Discovery

User specifies seed paper
Paperfinder finds related papers
Query generalization
create new queries that return related papers
Result filtering
sort the results according to relevance

30
Query Generalization How?

Keyword Extraction from titles/astract/text.
Semantic network find synonyms.
Princeton University
Bibliographic references.
CiteSeer
Use of subject descriptors.
ACM - PaperFinder
Seed authors
PaperFinder

31
Resource filtering

Filtering papers according to relevance
sort papers according to their author distance
from the seed paper
various other metrics could be possible
cf. Erdos Number
Co-authors of Erdos have Erdos Number 0
People who have written a paper with a co-author
of Erdos have Erdos Number 1
Similarity Metric
Avg(min/max) of author disctances

32
First Experimental Results

Filter At least as good as ACM.
Room for Improving other distance metrics
Identify the work of distinctive research groups
IDEA clusters of cooperating research groups
exploit links among
reveal new articles.
that other methods couldnt discover

33
Other distance metrics

Weighted Author Distance
No. of papers they have co-authored over the
total number of papers they have authored
Keyword Distance
Given two papers and a set of keywords
Paper distance No. keywords in both papers

34
Current database

9,000 bibliographic references
10,000 authors
100,000,000 author pairs
max distance 20
distant relations among authors

35
Experiment 1
Very relevant PaperFinder 7/20 - ACM
5/20 Not relevant PaperFinder 8/20 - ACM
11/20 Seed T. Anderson, M. Dahlin, J. Neefe,
D. Patterson, D. Roselli, R. Wang. Serverless
Network File Systems. ACM Trans. on Comp.
Systems, Feb. 1996. Topic distributed OS,
distributed file systems, reliability,
availability
36
Experiment 2
Identifying clusters of cooperating research
groups. Seed E. M. Chaves, T. J. LeBlanc, B. D.
Marsh, M. L. Scott, Kernel-kernel communication
in a Shared-Memory multiprocessor. Symposium on
Experiences and with Distributed and
multiprocessor Systems.
37
Summary-Conclusions