Effective Resource Discovery on the WorldWide Web - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Effective Resource Discovery on the WorldWide Web

Description:

ALTA-VISTA: 150 million URLs. Northern Light: 140 million URLs ... Alta-Vista returns: 61,000 hits about 'distributed systems' 9,000,000 hits about 'java' ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 42
Provided by: thanospapa
Category:

less

Transcript and Presenter's Notes

Title: Effective Resource Discovery on the WorldWide Web


1
Effective Resource Discovery on the World-Wide
Web
  • Athanasios E. Papathanasiou
  • M. Sc. Thesis,
  • Computer Science Department
  • University of Crete - Heraklion
  • This work is supported by the USENIX Association

2
?p?te?esµat??? ??a??t?s? ?????f???a? st??
?a???sµ?? ?st?
  • ??a??s??? ?. ?apa?a?as???
  • ?etapt???a?? ???as?a
  • ?µ?µa ?p?st?µ?? ?p?????st??
  • ?a?ep?st?µ?? ???t?? - ?????e??
  • ? e??as?a a?t? ?p?st????eta? ap? t? USENIX
    Association

3
Layout
  • Intro
  • Information Overloading Problem
  • USEwebNET
  • Design - User Interface - Implementation
  • Example of Use
  • Extensions
  • PaperFinder
  • Design - User Interface
  • Keyword based and Resource Discovery Mode
  • Experimental Results
  • Conclusions

4
Information Explosion
  • Terabytes of Information (May 99)
  • ALTA-VISTA 150 million URLs
  • Northern Light 140 million URLs
  • INKTOMI 110 million URLs
  • The above is a tiny fraction of the WWW
  • more than 320 million pages.
  • non-html information images, postscript,...
  • dynamic html pages
  • on-line shopping
  • html gateways to databases

5
More...
  • Web Size doubles every 12-18 months
  • searchenginewatch.internet.com
  • How big is the web?
  • www.neci.nj.nec.com/homepages/lawrence/websize.htm
    l

6
How much data is too much?
  • Alta-Vista returns
  • 61,000 hits about distributed systems
  • 9,000,000 hits about java
  • It requires 4 months to browse them at a rate of
    one URL per second, 24 hours/day

7
How much data is too much? (2)
  • To make matters worse
  • Even if one manages to read all the above
  • the next time (s)he asks the same query
  • it is the same data flood all over again
  • Ranking does NOT help.
  • synonymy (car - automobile)
  • polysemy (jaguar car or jungle cat)
  • spamming...

8
Needle in a haystack?
  • Information Retrieval is like
  • looking for a needle in a haystack
  • Solution
  • Use the right tools
  • X-ray glasses

9
USEwebNET A Tool forEffective Resource
Discovery on the World-Wide Web
  • Design - User Interface
  • Implementation
  • Example of use
  • Extensions

10
Traditional Model of Search
  • 1. Send a query to a Search Engine
  • 2. Receive thousands of URLs
  • 3. Spend a few hours searching
  • 4. Do some work
  • 5. Go to 1.
  • Single-shot search paradigm
  • no memory of previous searches/findings
  • We need Research tools

11
Search becomes Research
  • Human Tasks
  • Browse Information and
  • Discover Knowledge
  • Computer Tasks
  • Repeatedly inquire search engines
  • show humans new information
  • hide previously-seen information

12
Re-searching the Web
  • Users define topics of interest (cf. Newsgroups)
  • Java Programming
  • Digital Libraries
  • Computers
  • repeatedly inquire search engines
  • present new information to users
  • present updated information
  • recognize modifications
  • remember what URLs have been visited

13
USENET News
  • Computers
  • periodically gather recent news articles
  • remember which articles have been read
  • Humans
  • define newsgroups
  • write and publish articles
  • read only recent articles
  • staying up-to-date in a field

14
USENETweb USEwebNET
  • User Queries become Newsgroups
  • distributed systems
  • remote memory paging

15
USENETweb USEwebNET
  • Users interests become newsgroups
  • remote memory
  • URLs become published news artciles

16
User Interface Design
17
Back-End Engine
  • Result Fetcher Parser
  • Database update
  • E-Mail Notification
  • Crontab Manager
  • Difference Engine
  • ATT Labs htmldiff

18
Back-End Engine Architecture
Check and compare versions
19
Example of USE
  • Define a preference

20
Example of USE
  • All user preferences
  • Results

21
Example of USE
  • Reading/Saving URLs

22
A complete screenshot
23
Extensions
  • Digital Libraries
  • keep me posted about papers on Java
  • E-commerce
  • let me know about 1967 Tbirds
  • Keeping up-to-date
  • let me know if something new about Evangelos
    Markatos is published on the web

24
PaperFinder A Tool for Scalable Search of
Digital Libraries
  • Design - User Interface
  • Keyword Based Mode
  • Resource Discovery Mode
  • Experimental Results

25
Keeping up-to-date
  • Scientists want to stay up-to-date
  • they go to conferences
  • they subscribe to journals
  • they go to libraries
  • Being up-to-date is difficult
  • too many conferences/journals
  • If only there were a tool to find and deliver
    interesting papers.

26
PaperFinder
  • Paperfinder finds and reports interesting
    papers
  • Operates on top of popular DLs
  • Keyword search mode
  • Finds papers for distributed systems
  • Resource Discovery Mode
  • Finds papers related to a seed paper

27
Paperfinder Keyword search
  • User defines keyword search
  • e.g. find papers containing the term distributed
    systems
  • Paperfinder inquires Digital Libraries
  • it returns relevant papers
  • it remembers papers read so far
  • User invokes Paperfinder to find out whats new

28
PaperFinder Keyword Search
29
Resource Discovery
  • User specifies seed paper
  • Paperfinder finds related papers
  • Query generalization
  • create new queries that return related papers
  • Result filtering
  • sort the results according to relevance

30
Query Generalization How?
  • Keyword Extraction from titles/astract/text.
  • Semantic network find synonyms.
  • Princeton University
  • Bibliographic references.
  • CiteSeer
  • Use of subject descriptors.
  • ACM - PaperFinder
  • Seed authors
  • PaperFinder

31
Resource filtering
  • Filtering papers according to relevance
  • sort papers according to their author distance
    from the seed paper
  • various other metrics could be possible
  • cf. Erdos Number
  • Co-authors of Erdos have Erdos Number 0
  • People who have written a paper with a co-author
    of Erdos have Erdos Number 1
  • Similarity Metric
  • Avg(min/max) of author disctances

32
First Experimental Results
  • Filter At least as good as ACM.
  • Room for Improving other distance metrics
  • Identify the work of distinctive research groups
  • IDEA clusters of cooperating research groups
  • exploit links among
  • reveal new articles.
  • that other methods couldnt discover

33
Other distance metrics
  • Weighted Author Distance
  • No. of papers they have co-authored over the
    total number of papers they have authored
  • Keyword Distance
  • Given two papers and a set of keywords
  • Paper distance No. keywords in both papers

34
Current database
  • 9,000 bibliographic references
  • 10,000 authors
  • 100,000,000 author pairs
  • max distance 20
  • distant relations among authors

35
Experiment 1
Very relevant PaperFinder 7/20 - ACM
5/20 Not relevant PaperFinder 8/20 - ACM
11/20 Seed T. Anderson, M. Dahlin, J. Neefe,
D. Patterson, D. Roselli, R. Wang. Serverless
Network File Systems. ACM Trans. on Comp.
Systems, Feb. 1996. Topic distributed OS,
distributed file systems, reliability,
availability
36
Experiment 2
Identifying clusters of cooperating research
groups. Seed E. M. Chaves, T. J. LeBlanc, B. D.
Marsh, M. L. Scott, Kernel-kernel communication
in a Shared-Memory multiprocessor. Symposium on
Experiences and with Distributed and
multiprocessor Systems.
37
Summary-Conclusions
  • Searching needs good tools
  • Researching needs even better
  • USEwebNET combines
  • the information wealth of the web
  • with the user-interface of USENET
  • http//cuckoo.ics.forth.gr9002

38
Summary-Conclusions
  • Paperfinder keeps scientists up-to-date in their
    field
  • Paperfinder combines
  • known keyword-based query mode
  • Resource-discovery mode
  • memory to remember papers read
  • friendly USENET-like interface

39
Summary-Conclusions
  • USEwebNET - PaperFinder
  • Popular user-interface of USENET News
  • Whats new?
  • Hide known information
  • Offload busy web servers
  • PaperFinder
  • Filtering Presents most interesting first
  • The filters evolves
  • as new knowledge becomes available

40
Related Work (USEwebNET)
  • ARCHIE (ftp), Veronica (gopher)
  • Netfind (Colorado)
  • Search engines (Alta-Vista, ...)
  • Meta-searching (MetaCrawler)
  • The Informant (Dartmouth)
  • SenseMaker (Stanford)
  • Letizia (MIT)

41
Related Work (PaperFinder)
  • CiteSeer (NEC)
  • Dienst
  • ACM
Write a Comment
User Comments (0)
About PowerShow.com