Implementing a Web Crawler and Building a Web Index - PowerPoint PPT Presentation

About This Presentation
Title:

Implementing a Web Crawler and Building a Web Index

Description:

Implement a web crawler that explores the York University web site in order to ... a web index that processes all documents retrieved by the crawler and build ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 13
Provided by: kira1
Category:

less

Transcript and Presenter's Notes

Title: Implementing a Web Crawler and Building a Web Index


1
Implementing a Web Crawler and Building a Web
Index
  • Group 7
  • Sope Olorunfemi-Bork
  • Ravi Patel
  • Mani Aminian
  • Roma Kifle
  • Kira James - Presenter

2
Objective
  • Implement a web crawler that explores the York
    University web site in order to find documents
    and their contents
  • Build a web index that processes all documents
    retrieved by the crawler and build a two-level
    index for the documents

3
What we did differently
  • Started with file system BUT
  • retrieval of info. for search engine is more
    efficient with database
  • solved problem of reading and writing from file
    structure with Update and Insert capabilities
  • it is specifically designed, implemented and
    optimized for data storage

4
Data Structures Used
  • 3 vectors for temporary storage
  • Searched
  • To search
  • Vector matches
  • 4 main database tables for index storage
  • Keywords SortedKeyword
  • URLs
  • IndexMatrix SortedMatrix
  • Positions

5
Design
  • MDB file created by user
  • Set ODBC data source connection to MDB file and
    WebCrawler.java
  • c DriverManager.getConnection("jdbcodbcCrawler
    Index")
  • Query statements create tables in WebCrawler.java
  • As crawler is running, indexer does the following

6
Indexer
  • Indexes keywords, their positions and URL
    documents
  • Places them in Keyword, Position and URL tables
    respectively
  • Creates entries in the IndexMatrix table that
    stores corresponding entries in all three tables
    along with the keywords frequency of occurrence
    in the URL document
  • Creates SortedKeyword and SortedMatrix for easy
    retrieval by a search engine

7
Results - Tables
8
Results Tables
9
Results Crawler Applet
10
Program Complexity
  • Complexity of the Indexer
  • To READ contents from each page?O(n2)
  • To STORE keywords, urls, count and positions?
    O(n)

11
Complexity Chart
12
Thank You!
  • for talking _at_ the back of the lecture hall while
    I presented
Write a Comment
User Comments (0)
About PowerShow.com