Design Of Web Crawler and Web Indexer - PowerPoint PPT Presentation

About This Presentation
Title:

Design Of Web Crawler and Web Indexer

Description:

Add the URL pages list to the queue. Get words from the URLs ... Finding new URL links within the ... Brackets, Curly brackets, ampersand &, etc... WriteIndex ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 17
Provided by: mat6186
Category:

less

Transcript and Presenter's Notes

Title: Design Of Web Crawler and Web Indexer


1
Design Of Web Crawler and Web Indexer
Group 14 Pradeep Kanagaratnam 205237250 Peter
Pun 863383855 William, Yip Sai
Yeung 205145859 Farzana Khan 204924049
York University Information Technology 4020M
Professor Jimmy Huang
2
Web Crawling
  • Initiziating the methods
  • Start the Crawling
  • Downloading the URLs
  • Find new URL links
  • Add the URL pages list to the queue
  • Get words from the URLs source codes
  • In Main Method call the Crawled URLs list and
    Words

3
Finding New URL links
  • Finding new URL links within the source codes.
  • For example, ltA HREFhttp //www.yorku.ca/my
    file.htmlgt is a link.
  • As soon as we found a "lta" chracters in a html
    source code we look whether there is "gt
    character.
  • within those those two characters, we look for
    "href. Then we look for two quotation both
    sides.

4
Crawling detail
  • http//www.yorku.ca/dforster/course/2620/index.htm
    lannouncements
  • sign means that it links to the same page. If
    it is found within a link, we cut off the string
    following it.
  • if (u.length() gt 1 u.substring(0,1).equals(""
    ))
  • return false

5
Adding to Queue
  • Then the URL is submitted to the queue, which
    contains a list of URLs to be visited.
  • Before a URL is added
  • first we check if URL is already queued, and
    dont queue duplicate URLS
  • Some links do not have www and domain name, if
    not, then we add it.
  • Also we do not queue URLs that are outside the
    domain that we are crawling
  • We also try to check whether the page exist by
    looking at httpURLConnection response code

6
Adding to Queue (Explanations)
  • Sometimes, inside a html source codes, we see
    that a link start with lta href/registrar/entrol.
    htmlgt which means it is within the same domain
    name.
  • if the link does not start with http//www.
    then add the website address (http//www.yorku.ca/
    ) with the specific webpage address
  • For example "/registrar/entrol.htm" then add
    http//www.yorku.ca/registrar/entrol.htm
  • if (url.length() lt 4 url.substring(0,4).indexOf
    ("http") -1)
  • url websiteAddress url

7
Adding to Queue (Explanations)
  • Checking whether the URL links are within the
    same domain name.
  • If is it not a yorku.ca URL link, then we do not
    add into queue.
  • if (STAY_IN_DOMAIN getDomain(url).toLowerCase()
    .indexOf(getDomain(websiteAddress)) -1)
  • System.err.println("\tOUTSIDE OF DOMAIN "
    url)
  • return false

8
Adding to Queue (Explanations)
  • Checking whether the page exist
  • but it is re-directing the page to page not
    found page.
  • try
  • myURL new URL( url)
  • URLConnection conn myURL.openConnection()
  • HttpURLConnection conn (HttpURLConnection)
    myURL.openConnection()
  • conn.connect()
  • rc conn.getResponseCode()

9
Extract Contents
  • Scan page character by character
  • If lt is found, we skip until we find gt
  • For exampleltbrgtlttablegtlta hrefgtltimg srcgt
  • The resulting parsed page is stored in a String.

10
WriteIndex( )
  • WriteIndex( PageString, URL )
  • PageString contains the above mentioned page as a
    String
  • URL is the path to the html file

11
WriteIndex ( )
  • Use StringTokenizer to extract words by
    specifying separator characters
  • StringTokenizer token new StringTokenizer(urlPag
    e, " .,\t\n!-()\"\\'")
  • Separator chars include
  • Space, Comma, Brackets, Curly brackets, ampersand
    , etc

12
WriteIndex ( )
  • HTML uses the string nbsp to represent the TAB
    character
  • The char is already removed previously by
    StringTokenizer, so now we remove nbsp

13
WriteIndex( )
  • Uses embedded SQL syntax called SQLJ
  • First connect to Oracle with username and
    password
  • Oracle.connect( Index.class, "connect.properties"
    )
  • The file connect.properties is locate in the
    same folder as the source code.
  • It include the connect URL to Oracle and, User
    name and Password.
  • Everytime the program need to insert data or del
    data from sql it need to connect to the oracle by
    the Oracle.connect statement.

14
WriteIndex( )
  • Insert method to insert data in to SQL.
  • sql insert into index_web values
    (words,count,url)
  • Above is the statement that will insert the data
    into Oracle table call index_web.

15
WriteIndex( )
  • Commit and close connection.
  • After a INSERT or DELETE or UPDATE we require a
    COMMIT statement
  • Always use Oracle.close() to disconnect.
  • sqlCOMMIT
  • Oracle.close()

16
  • Thank You
  • Questions and Comments!!!
Write a Comment
User Comments (0)
About PowerShow.com