Introduction to Web Crawling and Regular Expression - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Web Crawling and Regular Expression

Description:

Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou Email: czhou_at_cse.cuhk.edu.hk – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 17
Provided by: cseCuhkE3
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Web Crawling and Regular Expression


1
Introduction to Web Crawling and Regular
Expression
  • CSC4170 Web Intelligence and Social Computing
  • Tutorial 1
  • Tutor Tom Chao Zhou
  • Email czhou_at_cse.cuhk.edu.hk

2
Outline
  • Course Tutors Information
  • Introduction to Web Crawling
  • Utilities of a crawler
  • Features of a crawler
  • Architecture of a crawler
  • Introduction to Regular Expression
  • Appendix

3
Course and Tutors Information
  • Course homepage
  • http//wiki.cse.cuhk.edu.hk/irwin.king/teaching/cs
    c4170/2009
  • Tutors
  • Xin Xin
  • Email xxin_at_cse.cuhk.edu.hk
  • Venue Room 101
  • Tom (me)
  • Email czhou_at_cse.cuhk.edu.hk
  • Venue Room 114A

4
Utilities of a crawler
  • Web crawler, spider.
  • Definition
  • A Web crawler is a computer program that browses
    the World Wide Web in a methodical, automated
    manner. (Wikipedia)
  • Utilities
  • Gather pages from the Web.
  • Support a search engine, perform data mining and
    so on.
  • Object
  • Text, video, image and so on.
  • Link structure.

5
Features of a crawler
  • Must provide
  • Robustness spider traps
  • Infinitely deep directory structures
    http//foo.com/bar/foo/bar/foo/...
  • Pages filled a large number of characters.
  • Politeness which pages can be crawled, and which
    cannot
  • robots exclusion protocol robots.txt
  • http//blog.sohu.com/robots.txt
  • User-agent
  • Disallow /manage/

6
Features of a crawler (Contd)
  • Should provide
  • Distributed
  • Scalable
  • Performance and efficiency
  • Quality
  • Freshness
  • Extensible

7
Architecture of a crawler
8
Architecture of a crawler (Contd)
  • URL Frontier containing URLs yet to be fetches
    in the current crawl. At first, a seed set is
    stored in URL Frontier, and a crawler begins by
    taking a URL from the seed set.
  • DNS domain name service resolution. Look up IP
    address for domain names.
  • Fetch generally use the http protocol to fetch
    the URL.
  • Parse the page is parsed. Texts (images, videos,
    and etc.) and Links are extracted.

9
Architecture of a crawler (Contd)
  • Content Seen? test whether a web page with the
    same content has already been seen at another
    URL. Need to develop a way to measure the
    fingerprint of a web page.
  • URL Filter
  • Whether the extracted URL should be excluded from
    the frontier (robots.txt).
  • URL should be normalized (relative encoding).
  • en.wikipedia.org/wiki/Main_Page
  • lta href"/wiki/WikipediaGeneral_disclaimer"
    title"WikipediaGeneral disclaimer"gtDisclaimerslt/
    agt
  • Dup URL Elim the URL is checked for duplicate
    elimination.

10
Architecture of a crawler (Contd)
  • Other issues
  • Housekeeping tasks
  • Log crawl progress statistics URLs crawled,
    frontier size, etc. (Every few seconds)
  • Checkpointing a snapshot of the crawlers state
    (the URL frontier) is committed to disk. (Every
    few hours)
  • Priority of URLs in URL frontier
  • Change rate.
  • Quality.
  • Politeness
  • Avoid repeated fetch requests to a host within a
    short time span.
  • Otherwise blocked ?

11
Regular Expression
  • Usage
  • Regular expressions provide a concise and
    flexible means for identifying strings of text of
    interest, such as particular characters, words or
    patterns of characters.
  • Todays target
  • Introduce the basic principle.
  • A tool to verify the regular expression Regex
    Tester
  • http//www.dotnet2themax.com/blogs/fbalena/PermaLi
    nk,guid,13bce26d-7755-441e-92b3-1eb5f9e859f9.aspx

12
Regular Expression
  • Metacharacter
  • Similar to the wildcard in Windows, e.g. .doc
  • Target Detect the email address

13
Regular Expression
  • \b stands for the beginning or end of a Word.
  • E.g. \bhi\b find hi accurately
  • \w matches letters, or numbers, or underscore.
  • . matches everything except the newline
  • content before can be repeated any number of
    times
  • \bhi\b.\bLucy\b
  • content before can be repeated one or more
    times
  • match characters in it
  • E.g \baeioua-zA-Z\b
  • n repeat n times
  • n, repeat n or more times
  • n,m repeat n to m times

14
Regular Expression
  • Target Detect the email address
  • Specifications
  • A_at_B
  • A combinations English characters a to z, or
    digits, or . or _ or or or
  • B cse.cuhk.edu.hk or cuhk.edu.hk (English
    characters)
  • Answer
  • \ba-z0-9._-_at_a-z.\.a-z2\b

15
Appendix
  • Mercator Crawler
  • http//mias.uiuc.edu/files/tutorials/mercator.pdf
  • Regular Expression tutorial
  • http//www.regular-expressions.info/tutorial.html

16
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com