Scraping Data from Websites with the 'NET Framework - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Scraping Data from Websites with the 'NET Framework

Description:

New England Code Camp IV: 'Developer's Gone Wild' IE Browser Control. Add IE as ActiveX Control ... Camp IV: 'Developer's Gone Wild' Html Agility Pack. Can ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 20
Provided by: philden
Category:

less

Transcript and Presenter's Notes

Title: Scraping Data from Websites with the 'NET Framework


1
Scraping Data from Websites with the .NET
Framework
  • Phil Denoncourt III
  • Denoncourt Associates
  • http//www.allphil.com/blog

2
Imagine
  • All the information available on the Internet
  • Stock Quotes
  • Box scores
  • Names/Addresses
  • Mapping
  • Pricing Information
  • Internal Applications
  • Imagine you could integrate any piece of
    information on the web into your application

3
Web Scraping
  • Automated browsing
  • HTML is retrieved
  • Data is extracted from the HTML
  • Extracted data is stored for later analysis

4
What would possess you?
  • Web services RSS make this obsolete
  • Sometimes the only way to get info
  • Closed Systems
  • No exposed webservice
  • Application was intended to be standalone
  • No direct access to datastore
  • Unit testing
  • Cool factor

5
Problems
  • Legal Might be violating Terms of Use
  • Very dependent on the HTML of the site
  • Small UI Changes break your code
  • Sometimes tricky
  • Mimicking Cookies
  • Maintaining Session
  • Authentication

6
Methodology
  • What data are you harvesting
  • Come up with a plan
  • How will you parse it
  • Where will you store the data
  • How will you update the data

7
Components
  • HTML Retrieval
  • HTML Parsing
  • Data Repository
  • Comparison Engine
  • Updates

8
HTML Retrieval
  • Complications
  • Site might depend on cookies
  • Site might require specific credentials
  • Server might have Robot prevention software
  • Firewalls

9
HTML Retrieval Methods
10
System.Net.HtmlWebRequest
  • Easy to use
  • Allows you to read HTML as a stream
  • Handles authentication
  • Have to code for firewalls
  • Cookies can be tricky
  • Result string of HTML

11
IE Browser Control
  • Add IE as ActiveX Control
  • Code controls the browser
  • Handles authentication, cookies, and firewalls
  • Cannot deal with badly formed HTML
  • Easy to use
  • Result Document Object Model (DOM)

12
IPersistStream
  • Still use IE as ActiveX control
  • Retrieve raw HTML using IPersistStream interface
  • Same advantages as IE Browser Control
  • Difficult to implement
  • Ask for Phils Wrapper dll
  • Result string of HTML

13
Parsing
14
Challenges in Parsing
  • HTML is usually sloppy
  • Most sites have badly formed HTML
  • A change in site design will change the parsing
    strategy
  • Parsing is CPU intensive

15
Strings of HTML
  • Relatively slow performing
  • .NET string class not geared for parsing
  • IndexOf, Substring, LastIndexOf
  • Case sensitivity
  • Code is likely to be reliant on site structure

16
Regular Expressions
  • More powerful than string parsing
  • Tend to be faster than string parsing
  • Syntax is awkward
  • My personal bias
  • Can be setup to be less reliant on site

17
Html Agility Pack
  • Can download HTML files
  • Uses WebRequest class
  • Parses HTML into an hierarchical XML document
  • Allows for badly formed HTML
  • Allows you to query for data using XPath
  • Reduces your reliance on site structure
  • Map out approach with Outline View

18
Demos
19
Resources
  • Google
  • Web Scraping .NET
  • Phils IPersistStream DLL
  • www.denoncourtassociates.com
  • HTML Agility Pack
  • http//smourier.blogspot.com/
  • CodeProject.com
  • Screen Scraping with C for ASP.NET
Write a Comment
User Comments (0)
About PowerShow.com