OverCite: A cooperative Digital Research Library - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

OverCite: A cooperative Digital Research Library

Description:

CiteSeer is the premier repository of scientific papers for the computer science ... Web crawler. Web-based front-end. Architecture. DHT process ... Web crawler ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 20
Provided by: GaoQ6
Category:

less

Transcript and Presenter's Notes

Title: OverCite: A cooperative Digital Research Library


1
OverCite A cooperative Digital Research Library
  • GaoQian
  • 2005-4-13

2
CiteSeer
  • CiteSeer is the premier repository of
    scientific papers for the computer science
    community. Supporting
  • Keywords search
  • Rank papers and authors
  • Identify similarity among papers

3
OverCite
  • OverCite is a proposal to develop Cite-seer for
    a new architecture for a distributed and
    cooperative research library based on a
    distri-buted hash table(DHT). It will be
  • 1) harness resources at many sites
  • 2) support new features such as document alerts
    and scale to lager data sets

4
Emphasis is not on the novelty of the design,
but on its benefits.
5
OverCite Architecture
  • DHT process
  • Index server
  • Web crawler
  • Web-based front-end

6
Architecture
  • DHT process
  • OverCite nodes participate in a DHT to robust
    storage documents and metadata.
  • Run on a few hundred stable nodes
  • Each node can keep a full routing table and
    provide one hop lookups

7
Architecture
  • Index server
  • OverCite partitions the inverted index by
    document into k index partitions.
  • Only k servers are involved in each query
  • Each of the k server uses about 1/kth of the
    CPU time required to search a single full-size
    inverted index.
  • Explore other optimizations

8
Architecture
  • Web crawler
  • Nodes coordinate the crawling effort via a list
    of to-be-crawled page URLs stored in the DHT.
  • Future work.

9
Architecture
  • Web-based front-end
  • A subset of OverCite nodes run.
  • Round-roubin DNS

10
CiteSeer Tables
DID( document ID ) CID( citation ID ) GID(
group ID)
11
OverSite Tables (not explicitly distinct
entities)
FID( a hash of the file contents ) Each node also
stores its partition of the inverted index locally
12
Calculations
  • Maintenance Resources
  • Query Resources
  • User Delay

13
Calculations ---- Maintenance Resources
  • Bandwidth Overhead
  • Per document ( fx (n/k)(x/10) ) bytes
  • f times the overhead f due to storage
    redundancy in the DHT
  • x average original file size
  • x/10 average size of the text file
  • Storage
  • Each node ((de)f / n i / k ) GB
    ?(de)(f1)/n i/k
  • d amount of storage used for documents
  • e amount of storage used for meta-data tables
  • i total index size

14
Calculation ----Query Resources
15
Calculation ----Query Resources
  • If there are n participating nodes, each DID is
    20 bytes, and the context and rank metric value
    together are 50 bytes, each query consumes about
    70mk bytes of traffic.
  • Assuming 250,000 searches per day, k 20, and
    returning m 10 results per query per node, our
    query design adds 3.5 GB of traffic per day to
    the network (or 35 MB per node).
  • This is a reasonablysmall fraction of the traffic
    currently served by CiteSeer (34.4 GB).
  • This does not include the meta-data lookup
    traffic for the top b matches, which is much
    smaller (areasonable value for b is 10 or 20).

16
Question
  • 1????????? total traffic document traffic
    34.4GB 21GB 14.4GB
  • 2????14.4GB???Server?user?????,????????servers???o
    verhead,?????14.4GB???????
  • 3?meta-date?????????????
  • 4?each node caches the meta-data for the
    documents in its index partition. ?????DHT??

17
Calculation ----User Delay
  • Parallelizable multiple DHT lookups
  • Research response time
  • twice the average round trip time of the network
    computation time
  • Generate a page about document (citations and
    what cite it)
  • additional delay for looking up extra meta-data

18
Features
  • Document Alerts (SmartSeer Continuous queries
    over CiteSeer)
  • Document Recommendations (like Amazon)
  • Plagiarism Checking
  • More documents

19
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com