Web Crawler - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Web Crawler

Description:

Web Crawler. Dr. Ying Xie. Class Node. class Node. public Node(int n, String s) ID = n; ... 2. extract embedded url, create node class, put them back to ... – PowerPoint PPT presentation

Number of Views:374
Avg rating:3.0/5.0
Slides: 15
Provided by: yxie
Category:

less

Transcript and Presenter's Notes

Title: Web Crawler


1
Web Crawler
  • Dr. Ying Xie

2
Class Node
  • class Node public Node(int n, String s)
    ID n urlString
    s public int getID()
    return ID
    public String getURL() return
    urlString private int ID
    private String urlString

3
Basic Algorithm
  • while(vectorToSearch.size() 0 count SEARCH_LIMIT)startNode (Node)vectorToSearch.ge
    t(0)vectorToSearch.remove(0)if
    (!searchPage(startNode)) return

4
searchPage
  • public boolean searchPage(Node fromNode)
  • url new URL(fromNode.getURL())
  • URLConnection urlCon url.openConnection()
  • urlCon.setAllowUserInteraction(false)
  • BufferedReader in new
    BufferedReader(new InputStreamReader(urlCon.getInp
    utStream()))
  • String inputwhile((input in.readLine())
    ! null)
  • //1. analyze the page content
  • //2. extract embedded url, create node
    class, put them back to vectorToSearch.

5
//1. Analyze the Page Content
  • if(input.toLowerCase().indexOf(" hasTitle false hasTitle
    truetitle parseForTitle(input, in)
    continue
  • if(input.toLowerCase().indexOf("input.toLowerCase().indexOf("keywords") ! -1
    hasKeywords false) hasKeywords
    true keywords parseForKeywords(input,
    in) continue
  • if(input.toLowerCase().indexOf("input.toLowerCase().indexOf("description") ! -1
    hasDescription false)
    hasDescription true description
    parseForDescription(input, in) continue


6
www.kennesaw.edu
  • content"en-us"
  • Kennesaw State University
  • content"text/html charsetwindows-1252"
  • University, located 20 miles north of Atlanta, is
    Northwest Georgia's major university, and is the
    fastest growing school in the University System
    of Georgia."
  • src"mm_menu.js"

7
Kennesaw State University
  • while (foundTag false) start j 1for
    (kstart k (foundTag false rawInput.charAt(k) ! ' titleLength title.append
    (rawInput.charAt(k)) else
    foundTag true break if
    (foundTag false) rawInput
    in.readLine() j -1

8
//Extract Embeded URLs
  • final String parseForLink(String input)
  • String upperCaseInput input.toUpperCase()
    int i,j,k,lString temp nullString link
    nulli upperCaseInput.indexOf ("HREF")
  • if (i ! -1) j
    upperCaseInput.indexOf ("\"",i) k
    upperCaseInput.indexOf ("\"",j1) if (j
    ! -1 k ! -1) temp
    input.substring (j1,k) link
    temp.trim () return (link)
  • return ""

9
//Validate extracted URL.
  • if (s.indexOf(".wav") ! -1) return ""
  • if (s.indexOf(".avi") ! -1) return ""
  • .

10
  • try if (fromURL null) url new
    URL(s) else url new URL(fromURL, s)
  • catch(MalformedURLException e)
    com.setStatus("NOTE it is a malformed url---"
    s) return "

11
  • if (url.getProtocol().compareTo("http") ! 0)
    com.setStatus("NOTE not http protocol---"
    url.toString())return ""

12
  • URLConnection urlCon
  • InputStream urlStream
  • try urlCon url.openConnection()
    urlStream url.openStream()
  • urlStream.close() return url.toString()
  • catch(IOException e) com.setStatus("NOTE
    can not open the url---" s) return "

13
Respect Robot Exclusion Protocol
  • public boolean robotSafe(URL url) String
    strHost url.getHost() String strRobot
    "http//" strHost "/robots.txt" URL
    urlRobot
  • try urlRobot new URL(strRobot)
    catch (MalformedURLException e) return
    false try
  • URLConnection urlRobotCon
    urlRobot.openConnection()
    urlRobotCon.setAllowUserInteraction(false)
    BufferedReader in new BufferedReader(new
    InputStreamReader(urlRobotCon.getInputStream()))
    String input while((input
    in.readLine()) ! null) if
    (input.toLowerCase().indexOf("disallow") ! -1)
    return false return true
    catch(IOException e) return true

14
RobotSafe Version
  • public boolean searchPage(Node fromNode)
  • url new URL(fromNode.getURL())if
    (!robotSafe(url))
  • return true
  • URLConnection urlCon url.openConnection()
  • urlCon.setAllowUserInteraction(false)
  • BufferedReader in new
    BufferedReader(new InputStreamReader(urlCon.getInp
    utStream()))
  • String inputwhile((input in.readLine())
    ! null)
  • //1. analyze the page content
  • //2. extract embedded url, create node
    class, put them back to vectorToSearch.
Write a Comment
User Comments (0)
About PowerShow.com